Data-Driven AI Decisions

Meet AI Eval

Model benchmarking, A/B testing, prompt evaluation, and quality metrics. Make data-driven decisions about which model and prompt work best for your use case.

# Run a comparative benchmark
POST /v1/benchmarks
{
  "suite": "translation-quality",
  "models": ["gpt-4o", "claude-sonnet-4-20250514"],
  "metrics": ["bleu", "rouge_l", "exact_match"],
  "dataset": "ds_translation_en_fr"
}

# Response — per-model scores
{
  "status": "completed",
  "results": {
    "gpt-4o":          { "bleu": 0.82, "rouge_l": 0.89 },
    "claude-sonnet-4": { "bleu": 0.87, "rouge_l": 0.91 }
  },
  "winner": "claude-sonnet-4"
}

Features

Everything you need to evaluate AI models systematically and make confident decisions.

📝
Test Suites

Organize test cases into reusable suites with inputs, expected outputs, and metadata for structured evaluation.

📈
Benchmarks

Run the same suite against multiple models simultaneously. Compare quality, latency, token usage, and cost side by side.

A/B Testing

Compare prompt variants on the same model with statistical confidence. Get win rates and confidence intervals.

🎯
Metrics

Built-in BLEU, ROUGE-L, exact match, contains, regex match, latency, cost, and token efficiency scoring.

🗃
Datasets

Import and export evaluation datasets in JSONL or CSV. Tag entries for targeted evaluation runs.

📊
Reports

Detailed per-model summaries with average metrics, latency percentiles, cost breakdown, and individual case results.

Define Test Suites

Create structured test suites with inputs, expected outputs, and the metrics you care about. Reuse them across benchmarks and models.

  • Multiple test cases per suite
  • Expected outputs for automated scoring
  • Tags and metadata for filtering
  • Import from JSONL or CSV datasets
# Create a test suite
POST /v1/suites
{
  "name": "Translation Quality",
  "testCases": [
    {
      "input": "Translate to French: Hello world",
      "expected": "Bonjour le monde",
      "tags": ["greeting"]
    },
    {
      "input": "Translate to French: Good morning",
      "expected": "Bonjour",
      "tags": ["greeting"]
    }
  ],
  "metrics": ["bleu", "rouge_l", "exact_match"]
}

A/B Test Prompts

Find the best prompt for your use case. Compare multiple variants on the same model and dataset with statistical confidence.

  • Test 2+ prompt variants simultaneously
  • Each variant has its own system + user prompt
  • Win rates and confidence intervals
  • Per-variant latency and cost breakdown
# A/B test two prompt strategies
POST /v1/ab-tests
{
  "model": "gpt-4o",
  "dataset": "ds_summarization",
  "variants": [
    {
      "name": "concise",
      "system": "Summarize in 1 sentence."
    },
    {
      "name": "detailed",
      "system": "Summarize with key points."
    }
  ],
  "metrics": ["rouge_l", "bleu"]
}
# => winner: "detailed" (67% win rate, p<0.05)

Detailed Reports

Get comprehensive reports with per-model scores, latency percentiles, cost analysis, and individual test case results.

  • Aggregate scores per model
  • P50, P95, P99 latency percentiles
  • Cost breakdown by model and tokens
  • Export results as JSON or CSV
# Get benchmark report
GET /v1/reports/bench_abc123

{
  "models": {
    "gpt-4o": {
      "bleu": 0.82, "rouge_l": 0.89,
      "latency_p50": 420, "latency_p95": 890,
      "cost_usd": 0.034,
      "tokens": { "input": 12400, "output": 6200 }
    },
    "claude-sonnet-4": {
      "bleu": 0.87, "rouge_l": 0.91,
      "latency_p50": 380, "latency_p95": 720,
      "cost_usd": 0.029,
      "tokens": { "input": 12400, "output": 5800 }
    }
  }
}

How It Compares

See how Koder AI Eval stacks up against popular evaluation platforms.

FeatureKoder AI EvalPromptfooBraintrustHumanloopLangSmith
Multi-model benchmarksPartial
A/B testing with statisticsPartial
Built-in quality metrics (BLEU, ROUGE-L)Partial
Dataset management (JSONL/CSV)
Latency & cost trackingPartial
Custom scoring functionsPartial
Self-hosted / open source
REST APICLI only
Any LLM providerPartialPartial

Frequently Asked Questions

Any model accessible through the Koder AI Gateway, including OpenAI GPT-4o, Anthropic Claude, Google Gemini, Meta Llama, Mistral, and any custom model endpoint.

Yes. The metric registry is extensible. You can register custom scoring functions including regex matching, JSON schema validation, and semantic similarity via embeddings.

You define 2 or more prompt variants (system prompt + user prompt template), pick a model and dataset, then run the test. Each variant is evaluated against every dataset entry, and results include win rates and confidence intervals.

JSONL and CSV. Each entry has an input, expected output, optional context, and tags. You can import and export in both formats via the REST API.

Yes. The benchmark runner uses a configurable worker pool with concurrency limits to avoid overwhelming the AI gateway. Failed requests are retried with exponential backoff.

Built-in metrics include exact_match, contains, BLEU (n-gram overlap), ROUGE-L (longest common subsequence), regex_match, latency (ms), cost (USD), and token_efficiency (quality per token).

Make every AI decision count

Benchmark models, test prompts, and ship with confidence — backed by real data.