Hackathon Showcase

Goblin

Goblin enables plan-based A/B testing; on ~6,000 FrancophonIA samples it found claude-sonnet 59.78% vs haiku 55.05%, cost +21,700%.

4 members Watch Demo

YouTube Video

Project Description

Challenge Note:

We are submitting for the AI product Innovation challenge, not for the use case, but for the AI orchestration model that goblin is built on, which allows for asynchronous graph execution.

Overview:

Goblin is a purpose-built AI-powered compliance platform that helps small and mid-sized companies meet the EU’s Digital Services Act (DSA) requirements without building costly in-house compliance operations tools.

Goblin automates hate speech detection, generates outputs for audit-ready reports, and allows continuously optimizes performance by benchmarking multiple AI models through A/B testing. This reduces manual workloads for Trust & Safety teams and frees Engineering to focus on core product development.

Functionality:

Identified a critical gap in tooling for AI comparison, model deployment, and seamless integration, specifically for hate-speech identification and moderation.
Built a working engine using an acyclic graph approach with near plain-English configuration — similar in spirit to NGINX Unit, but simpler and more flexible for complex workflows.
Executed a real-world test case: evaluated multiple AI models for hate speech detection using the same prompts, generating measurable, differentiated results.
Supports building and testing an arbitrary number of models with varied prompts, collecting datasets, and routing outputs to user-defined endpoints — while maintaining a production-ready workflow.

Customer Validation:

Natalie Bridgers, Streamplace: “A product like you are describing would be exceptionally helpful for AI moderation tools present at stream.place.”
Joel Kaiser, First Rule: “This scripting engine concept would be exceptionally helpful for our more advanced model fine tuning; we could easily see us using this as a production service immediately.”
Casey O’Malley, CB Insights: “The tool you’re describing to me would be useful for my job function as a Data Scientist.” (He even joined the team for the hackathon after the original pitch!)

Design & UX:

Simple, intuitive run-time workflow enables users to configure compliance pipelines without heavy technical overhead.
Users can use their programming language of choice
Designed for seamless demo: upload data → run A/B model comparison → receive output that can be quickly adapted into compliance-ready report.

Feasibility:

Strong personal experience: team members have worked directly on compliance-heavy operations and understand the pain points.
Comparable companies (e.g., Statsig) show billion-dollar acquisition potential for related AI/optimization platforms.
The compliance angle addresses an immediate high-stakes market need, while the underlying engine can generalize to other workflows
Determined for our use case that there was a 21.7x difference in cost without highly varied performance, validated through A/B testing model
For companies with high UGC volume, this translates into hundreds of thousands of dollars in cost savings per month.

Team Execution:

Built a composable system supporting local models (BOW, HuggingFace+, LightGBM) and API models (Claude, OpenAI) with secure credential management.
Abstracts datasets, features, and algorithms into plug-and-play components. Integrates with Goblin execution engine for A/B testing and automated evaluation pipelines across Berkeley and custom datasets.
Built fully generalizable runtime environment for executing AI workflows, including Generative AI and Traditional ML.
Executed on a real-world use case and found opportunity for our ICP to save hundreds of thousands of dollars.

Results:

Utilizing goblin as a runtime for A/B testing, we compared Anthropic models Sonnet and Haiku, and determined that while only a 3.5% difference in performance was present between the smaller and larger model, the cost difference was over a factor of 21.7x, over millions of instances of UGC per day, this is a six figure savings month-to-month, and exceptionally worth it to test and utilize.

Prior Work

See github history if curious. Some simple project structure, well within what you would find acceptable was done on Friday, stuff like instantiating a uv python project, but nothing more than that. Willy B also has real production experience on the exact product use case, and had conceptualized the concept of a plan of scripts, and an arbitrary execution manager, but did not begin building these concepts.

All real work began in earnest on the second day of the hackathon, after proper planning and aligning our use case to product market fit.