Model benchmarking, A/B testing, prompt evaluation, and quality metrics. Make data-driven decisions about which model and prompt work best for your use case.
# Run a comparative benchmark POST /v1/benchmarks { "suite": "translation-quality", "models": ["gpt-4o", "claude-sonnet-4-20250514"], "metrics": ["bleu", "rouge_l", "exact_match"], "dataset": "ds_translation_en_fr" } # Response — per-model scores { "status": "completed", "results": { "gpt-4o": { "bleu": 0.82, "rouge_l": 0.89 }, "claude-sonnet-4": { "bleu": 0.87, "rouge_l": 0.91 } }, "winner": "claude-sonnet-4" }
Everything you need to evaluate AI models systematically and make confident decisions.
Organize test cases into reusable suites with inputs, expected outputs, and metadata for structured evaluation.
Run the same suite against multiple models simultaneously. Compare quality, latency, token usage, and cost side by side.
Compare prompt variants on the same model with statistical confidence. Get win rates and confidence intervals.
Built-in BLEU, ROUGE-L, exact match, contains, regex match, latency, cost, and token efficiency scoring.
Import and export evaluation datasets in JSONL or CSV. Tag entries for targeted evaluation runs.
Detailed per-model summaries with average metrics, latency percentiles, cost breakdown, and individual case results.
Create structured test suites with inputs, expected outputs, and the metrics you care about. Reuse them across benchmarks and models.
# Create a test suite POST /v1/suites { "name": "Translation Quality", "testCases": [ { "input": "Translate to French: Hello world", "expected": "Bonjour le monde", "tags": ["greeting"] }, { "input": "Translate to French: Good morning", "expected": "Bonjour", "tags": ["greeting"] } ], "metrics": ["bleu", "rouge_l", "exact_match"] }
Find the best prompt for your use case. Compare multiple variants on the same model and dataset with statistical confidence.
# A/B test two prompt strategies POST /v1/ab-tests { "model": "gpt-4o", "dataset": "ds_summarization", "variants": [ { "name": "concise", "system": "Summarize in 1 sentence." }, { "name": "detailed", "system": "Summarize with key points." } ], "metrics": ["rouge_l", "bleu"] } # => winner: "detailed" (67% win rate, p<0.05)
Get comprehensive reports with per-model scores, latency percentiles, cost analysis, and individual test case results.
# Get benchmark report GET /v1/reports/bench_abc123 { "models": { "gpt-4o": { "bleu": 0.82, "rouge_l": 0.89, "latency_p50": 420, "latency_p95": 890, "cost_usd": 0.034, "tokens": { "input": 12400, "output": 6200 } }, "claude-sonnet-4": { "bleu": 0.87, "rouge_l": 0.91, "latency_p50": 380, "latency_p95": 720, "cost_usd": 0.029, "tokens": { "input": 12400, "output": 5800 } } } }
See how Koder AI Eval stacks up against popular evaluation platforms.
| Feature | Koder AI Eval | Promptfoo | Braintrust | Humanloop | LangSmith |
|---|---|---|---|---|---|
| Multi-model benchmarks | ✓ | ✓ | ✓ | Partial | ✓ |
| A/B testing with statistics | ✓ | Partial | ✓ | ✓ | — |
| Built-in quality metrics (BLEU, ROUGE-L) | ✓ | ✓ | Partial | — | — |
| Dataset management (JSONL/CSV) | ✓ | ✓ | ✓ | ✓ | ✓ |
| Latency & cost tracking | ✓ | Partial | ✓ | ✓ | ✓ |
| Custom scoring functions | ✓ | ✓ | ✓ | ✓ | Partial |
| Self-hosted / open source | ✓ | ✓ | — | — | — |
| REST API | ✓ | CLI only | ✓ | ✓ | ✓ |
| Any LLM provider | ✓ | ✓ | ✓ | Partial | Partial |
Any model accessible through the Koder AI Gateway, including OpenAI GPT-4o, Anthropic Claude, Google Gemini, Meta Llama, Mistral, and any custom model endpoint.
Yes. The metric registry is extensible. You can register custom scoring functions including regex matching, JSON schema validation, and semantic similarity via embeddings.
You define 2 or more prompt variants (system prompt + user prompt template), pick a model and dataset, then run the test. Each variant is evaluated against every dataset entry, and results include win rates and confidence intervals.
JSONL and CSV. Each entry has an input, expected output, optional context, and tags. You can import and export in both formats via the REST API.
Yes. The benchmark runner uses a configurable worker pool with concurrency limits to avoid overwhelming the AI gateway. Failed requests are retried with exponential backoff.
Built-in metrics include exact_match, contains, BLEU (n-gram overlap), ROUGE-L (longest common subsequence), regex_match, latency (ms), cost (USD), and token_efficiency (quality per token).
Benchmark models, test prompts, and ship with confidence — backed by real data.