Data-Driven AI Decisions

Meet AI Eval

Model benchmarking, A/B testing, prompt evaluation, and quality metrics. Make data-driven decisions about which model and prompt work best for your use case.

# Run a comparative benchmark
POST /v1/benchmarks
{
  "suite": "translation-quality",
  "models": ["gpt-4o", "claude-sonnet-4-20250514"],
  "metrics": ["bleu", "rouge_l", "exact_match"],
  "dataset": "ds_translation_en_fr"
}

# Response — per-model scores
{
  "status": "completed",
  "results": {
    "gpt-4o":          { "bleu": 0.82, "rouge_l": 0.89 },
    "claude-sonnet-4": { "bleu": 0.87, "rouge_l": 0.91 }
  },
  "winner": "claude-sonnet-4"
}

Features

Everything you need to evaluate AI models systematically and make confident decisions.

📝
Test Suites

Organize test cases into reusable suites with inputs, expected outputs, and metadata for structured evaluation.

📈
Benchmarks

Run the same suite against multiple models simultaneously. Compare quality, latency, token usage, and cost side by side.

⚖
A/B Testing

Compare prompt variants on the same model with statistical confidence. Get win rates and confidence intervals.

🎯
Metrics

Built-in BLEU, ROUGE-L, exact match, contains, regex match, latency, cost, and token efficiency scoring.

🗃
Datasets

Import and export evaluation datasets in JSONL or CSV. Tag entries for targeted evaluation runs.

📊
Reports

Detailed per-model summaries with average metrics, latency percentiles, cost breakdown, and individual case results.

Define Test Suites

Create structured test suites with inputs, expected outputs, and the metrics you care about. Reuse them across benchmarks and models.

Multiple test cases per suite
Expected outputs for automated scoring
Tags and metadata for filtering
Import from JSONL or CSV datasets

# Create a test suite
POST /v1/suites
{
  "name": "Translation Quality",
  "testCases": [
    {
      "input": "Translate to French: Hello world",
      "expected": "Bonjour le monde",
      "tags": ["greeting"]
    },
    {
      "input": "Translate to French: Good morning",
      "expected": "Bonjour",
      "tags": ["greeting"]
    }
  ],
  "metrics": ["bleu", "rouge_l", "exact_match"]
}

A/B Test Prompts

Find the best prompt for your use case. Compare multiple variants on the same model and dataset with statistical confidence.

Test 2+ prompt variants simultaneously
Each variant has its own system + user prompt
Win rates and confidence intervals
Per-variant latency and cost breakdown

# A/B test two prompt strategies
POST /v1/ab-tests
{
  "model": "gpt-4o",
  "dataset": "ds_summarization",
  "variants": [
    {
      "name": "concise",
      "system": "Summarize in 1 sentence."
    },
    {
      "name": "detailed",
      "system": "Summarize with key points."
    }
  ],
  "metrics": ["rouge_l", "bleu"]
}
# => winner: "detailed" (67% win rate, p<0.05)

Detailed Reports

Get comprehensive reports with per-model scores, latency percentiles, cost analysis, and individual test case results.

Aggregate scores per model
P50, P95, P99 latency percentiles
Cost breakdown by model and tokens
Export results as JSON or CSV

# Get benchmark report
GET /v1/reports/bench_abc123

{
  "models": {
    "gpt-4o": {
      "bleu": 0.82, "rouge_l": 0.89,
      "latency_p50": 420, "latency_p95": 890,
      "cost_usd": 0.034,
      "tokens": { "input": 12400, "output": 6200 }
    },
    "claude-sonnet-4": {
      "bleu": 0.87, "rouge_l": 0.91,
      "latency_p50": 380, "latency_p95": 720,
      "cost_usd": 0.029,
      "tokens": { "input": 12400, "output": 5800 }
    }
  }
}

How It Compares

See how Koder AI Eval stacks up against popular evaluation platforms.

Feature	Koder AI Eval	Promptfoo	Braintrust	Humanloop	LangSmith
Multi-model benchmarks	✓	✓	✓	Partial	✓
A/B testing with statistics	✓	Partial	✓	✓	—
Built-in quality metrics (BLEU, ROUGE-L)	✓	✓	Partial	—	—
Dataset management (JSONL/CSV)	✓	✓	✓	✓	✓
Latency & cost tracking	✓	Partial	✓	✓	✓
Custom scoring functions	✓	✓	✓	✓	Partial
Self-hosted / open source	✓	✓	—	—	—
REST API	✓	CLI only	✓	✓	✓
Any LLM provider	✓	✓	✓	Partial	Partial

Frequently Asked Questions

Any model accessible through the Koder AI Gateway, including OpenAI GPT-4o, Anthropic Claude, Google Gemini, Meta Llama, Mistral, and any custom model endpoint.

Yes. The metric registry is extensible. You can register custom scoring functions including regex matching, JSON schema validation, and semantic similarity via embeddings.

You define 2 or more prompt variants (system prompt + user prompt template), pick a model and dataset, then run the test. Each variant is evaluated against every dataset entry, and results include win rates and confidence intervals.

JSONL and CSV. Each entry has an input, expected output, optional context, and tags. You can import and export in both formats via the REST API.

Yes. The benchmark runner uses a configurable worker pool with concurrency limits to avoid overwhelming the AI gateway. Failed requests are retried with exponential backoff.

Built-in metrics include exact_match, contains, BLEU (n-gram overlap), ROUGE-L (longest common subsequence), regex_match, latency (ms), cost (USD), and token_efficiency (quality per token).

Meet AI Eval

Features

📝Test Suites

📈Benchmarks

⚖A/B Testing

🎯Metrics

🗃Datasets

📊Reports