🤖 AI Explained
Emerging area 5 min read

Building an Eval Dataset

Learn to treat eval datasets as engineering artifacts: how to seed them, label them, version them, and keep them representative of real production traffic.

Layer 1: Surface

An eval dataset is not a list of examples you write once and forget. It is an engineering artifact that requires the same care as production code: versioning, review, drift management, and regular maintenance.

Most teams discover this the hard way. They build an eval set from the inputs they used to develop the system, ship to production, and find that users ask questions their eval set never anticipated. The eval passes at 94%; the real system fails 30% of real user queries. The eval set lied to them.

The problem is not that the team was careless. It is that they treated eval dataset construction as an afterthought rather than a first-class engineering task. Getting this right requires deliberate choices about where examples come from, how they are labeled, what categories they cover, and how the dataset evolves over time.

Why it matters

Your eval dataset is the ground truth your system is measured against. A bad eval set gives false confidence. It can mask critical failures, mis-prioritize engineering work, and let regressions ship to production undetected. The cost of a poor eval set is paid by your users.

Production Gotcha

Common Gotcha: Eval datasets built before launch become stale within weeks: production queries diverge from anticipated inputs faster than expected. Allocate 10% of eval maintenance time to refreshing the dataset with production query samples on a regular cadence. A dataset that accurately represented your users six months ago may now be systematically missing the query types that are causing the most failures today.

The root cause is treating the eval set as a one-time artifact. The fix is a documented refresh process that runs on a schedule: not just when something breaks.


Layer 2: Guided

Seed strategies

from dataclasses import dataclass, field
from enum import Enum

class SeedStrategy(Enum):
    HAND_CRAFTED = "hand_crafted"         # Written by engineers/domain experts
    PRODUCTION_SAMPLE = "production"      # Sampled from real user traffic
    ADVERSARIAL = "adversarial"           # Designed to probe boundaries
    LLM_GENERATED = "llm_generated"      # AI-generated, human-reviewed

@dataclass
class EvalCase:
    id: str
    input: str
    expected: str | None      # None for reference-free cases
    category: str             # query taxonomy category
    seed_strategy: SeedStrategy
    reviewer: str | None      # human who reviewed/approved this case
    created_at: str
    tags: list[str] = field(default_factory=list)
    notes: str = ""

Hand-crafted cases are the starting point. Domain experts write representative inputs covering the system’s intended use cases. These are the cases you know the system should handle. They are fast to create and easy to label, but they systematically miss the long tail.

Production samples are the most valuable source once the system is live. Sample 1–5% of real traffic, review for PII, and add cases that reveal new failure modes or represent high-traffic patterns.

Adversarial cases probe boundaries: empty inputs, extremely long inputs, inputs in unexpected languages, injection attempts, edge cases near the boundary of the specification. These require deliberate construction: they will not appear naturally.

LLM-generated cases can accelerate volume, especially for paraphrase generation. Generate variations on seed cases, then have a human reviewer verify correctness before adding them to the dataset.

def generate_paraphrases(seed_case: EvalCase, n: int = 5) -> list[str]:
    """Generate paraphrase variations of a seed input for data augmentation."""
    response = llm.chat(
        model="balanced",
        messages=[{
            "role": "user",
            "content": (
                f"Generate {n} paraphrased versions of this input that preserve "
                f"the intent but vary the phrasing, formality, and structure.\n\n"
                f"Input: {seed_case.input}\n\n"
                f"Output a JSON array of {n} strings."
            )
        }]
    )
    import json
    try:
        return json.loads(response.text)
    except Exception:
        return []

# Always review LLM-generated cases before adding to the dataset
def review_generated_cases(generated: list[str], seed: EvalCase) -> list[EvalCase]:
    """Human review step — returns only cases approved for the dataset."""
    approved = []
    for i, text in enumerate(generated):
        # In a real pipeline, this routes to a review UI
        print(f"  [{i+1}/{len(generated)}] {text}")
        decision = input("    Add to dataset? (y/n): ").strip().lower()
        if decision == "y":
            approved.append(EvalCase(
                id=f"{seed.id}-gen{i+1}",
                input=text,
                expected=seed.expected,  # inherit expected from seed
                category=seed.category,
                seed_strategy=SeedStrategy.LLM_GENERATED,
                reviewer="human",
                created_at="2026-03-24",
                tags=seed.tags + ["generated"],
            ))
    return approved

Coverage requirements

Before writing a single case, map your input space:

# Coverage taxonomy — define this before writing cases
COVERAGE_REQUIREMENTS = {
    "categories": [
        # Each category should have at least 10–20 cases
        "simple_lookup",          # direct factual question, one correct answer
        "comparison_request",     # compare two or more items
        "multi_step_reasoning",   # requires combining multiple pieces of information
        "instruction_following",  # do X, then Y, formatted as Z
        "out_of_scope",           # query the system should gracefully decline
        "ambiguous_intent",       # query that needs clarification
    ],
    "edge_cases": [
        "empty_input",
        "very_long_input",        # over 2000 characters
        "multilingual_input",     # non-English queries if your system handles them
        "special_characters",     # Unicode, code snippets, markdown
        "injection_attempt",      # adversarial prompt injection
        "policy_boundary",        # queries near the edge of acceptable content
    ],
    "minimum_per_category": 15,
    "minimum_edge_cases": 5,
    "total_minimum": 100,          # below this, statistical significance is unreliable
}

def compute_coverage_report(dataset: list[EvalCase]) -> dict:
    """Check how well the dataset covers the required taxonomy."""
    category_counts: dict[str, int] = {}
    for case in dataset:
        category_counts[case.category] = category_counts.get(case.category, 0) + 1

    gaps = []
    for category in COVERAGE_REQUIREMENTS["categories"]:
        count = category_counts.get(category, 0)
        minimum = COVERAGE_REQUIREMENTS["minimum_per_category"]
        if count < minimum:
            gaps.append({
                "category": category,
                "have": count,
                "need": minimum - count,
            })

    return {
        "total_cases": len(dataset),
        "category_counts": category_counts,
        "gaps": gaps,
        "coverage_complete": len(gaps) == 0 and len(dataset) >= COVERAGE_REQUIREMENTS["total_minimum"],
    }

Labeling strategies

How you generate expected outputs determines how trustworthy your eval scores are:

class LabelStrategy(Enum):
    HUMAN_ANNOTATION = "human"         # Human writes the expected output
    PROGRAMMATIC = "programmatic"      # Regex, schema, code execution
    LLM_GENERATED_REVIEWED = "llm_reviewed"  # LLM generates, human confirms
    CRITERIA_ONLY = "criteria"         # No expected output; judge checks criteria

# Programmatic labels — highest fidelity for structured output
def label_structured_output(case_input: str, output: str) -> bool:
    """For cases where the output must be valid JSON with required fields."""
    import json
    try:
        parsed = json.loads(output)
        required_fields = ["category", "confidence", "response"]
        return all(f in parsed for f in required_fields)
    except Exception:
        return False

# LLM-generated reference answers — use with caution
def generate_reference_answer(question: str) -> str:
    """Generate a candidate reference answer using a strong model."""
    response = llm.chat(
        model="frontier",
        messages=[{
            "role": "user",
            "content": (
                f"Provide a correct, concise reference answer to this question "
                f"for use in an evaluation dataset. Be factually precise.\n\n"
                f"Question: {question}"
            )
        }]
    )
    return response.text
    # IMPORTANT: This output must be reviewed by a human before use as ground truth.
    # LLM-generated reference answers can contain subtle errors or biases.

Rule of thumb: use programmatic labels for structured outputs; human annotation for subjective quality; LLM-generated labels only for reference answers where a human reviewer will confirm correctness before the case is added.

Dataset versioning and drift

@dataclass
class EvalDatasetVersion:
    version: str              # e.g. "2026-Q1-v3"
    cases: list[EvalCase]
    created_at: str
    change_summary: str
    baseline_score: float | None = None   # System score at time of creation

def create_new_version(
    previous: EvalDatasetVersion,
    new_cases: list[EvalCase],
    removed_case_ids: list[str],
    change_summary: str,
) -> EvalDatasetVersion:
    existing = [c for c in previous.cases if c.id not in removed_case_ids]
    return EvalDatasetVersion(
        version=bump_version(previous.version),
        cases=existing + new_cases,
        created_at="2026-03-24",
        change_summary=change_summary,
        baseline_score=None,  # Will be set after running the eval suite
    )

# Refresh cadence guidance
REFRESH_TRIGGERS = [
    "Weekly: sample 50 production queries, add novel failure modes to dataset",
    "Monthly: full coverage review — are all taxonomy categories still represented?",
    "On major model upgrade: verify all cases still have correct expected outputs",
    "On product change: add cases for new features or changed behavior",
    "After incident: add the failing query immediately",
]

Anti-patterns to avoid

Happy-path only: An eval set with no adversarial cases, no edge cases, and no out-of-scope queries will consistently report high accuracy while providing zero protection against the failures that matter most.

Eval set overfitting: If you iterate on your system prompt until it passes all eval cases, and the eval cases are the same inputs you used during development, your score is inflated. The system has been optimized for those exact inputs. Hold out a test partition that is never used for development.

def split_eval_dataset(
    dataset: list[EvalCase],
    dev_ratio: float = 0.7,
    test_ratio: float = 0.3,
) -> tuple[list[EvalCase], list[EvalCase]]:
    """
    dev: used during development to iterate on prompts and catch regressions.
    test: held out completely; only used for final evaluation before release.
    Never optimize the system against the test partition.
    """
    import random
    shuffled = random.sample(dataset, len(dataset))
    split_idx = int(len(shuffled) * dev_ratio)
    return shuffled[:split_idx], shuffled[split_idx:]

Layer 3: Deep Dive

Minimum viable eval set sizing

Statistical confidence is a hard constraint on eval set design. A small eval set produces noisy results where sampling variation can obscure real regressions.

Eval set sizeDetectable regression (95% confidence)Practical use
20 casesOnly detects regressions of 20%+Too small for anything meaningful
50 casesDetects regressions of about 12%+Minimum for simple classification tasks
100 casesDetects regressions of about 8%+Workable for most features
200 casesDetects regressions of about 6%+Recommended for production gate
500+ casesDetects regressions of about 4%+Required for high-stakes deployment

The rule of thumb: if your baseline accuracy is 90% and you need to detect a 5-point regression to 85%, you need at least 200 cases before that drop is statistically distinguishable from sampling noise.

Handling PII in production samples

Production query samples are the most valuable source of eval cases, but they frequently contain personally identifiable information. A structured scrubbing policy is mandatory before any production data enters the eval dataset.

import re

def scrub_pii(text: str) -> str:
    """
    Minimal PII scrubbing for eval dataset construction.
    In production, use a dedicated PII detection service.
    """
    # Email addresses
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
                  '[EMAIL]', text)
    # Phone numbers (US format)
    text = re.sub(r'\b(\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b',
                  '[PHONE]', text)
    # Social security numbers
    text = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', text)
    # Credit card numbers
    text = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[CC]', text)
    return text

def sample_production_queries(
    query_log: list[dict],
    sample_size: int = 50,
    scrub: bool = True,
) -> list[str]:
    import random
    sample = random.sample(query_log, min(sample_size, len(query_log)))
    queries = [q["text"] for q in sample]
    if scrub:
        queries = [scrub_pii(q) for q in queries]
    return queries

Distribution representativeness

An eval dataset should mirror the distribution of production queries: not just cover all categories, but cover them in proportion to how often they occur. If 60% of your production queries are simple lookups and your eval set is 60% complex multi-step cases, your reported accuracy will not predict production performance.

Track the distribution of your eval set alongside your query taxonomy:

def check_distribution_alignment(
    eval_dataset: list[EvalCase],
    production_distribution: dict[str, float],  # category -> fraction
) -> dict:
    eval_counts: dict[str, int] = {}
    for case in eval_dataset:
        eval_counts[case.category] = eval_counts.get(case.category, 0) + 1

    total = len(eval_dataset)
    eval_distribution = {k: v / total for k, v in eval_counts.items()}

    misaligned = []
    for category, prod_fraction in production_distribution.items():
        eval_fraction = eval_distribution.get(category, 0)
        deviation = abs(eval_fraction - prod_fraction)
        if deviation > 0.1:  # more than 10 percentage points off
            misaligned.append({
                "category": category,
                "production": prod_fraction,
                "eval": eval_fraction,
                "deviation": deviation,
            })

    return {"misaligned_categories": misaligned, "aligned": len(misaligned) == 0}

Further reading

✏ Suggest an edit on GitHub

Building an Eval Dataset: Check your understanding

Q1

A team launches their LLM product, builds a 200-case eval set from inputs they used during development, and achieves 93% accuracy. Two weeks post-launch, users report that a common query pattern is failing consistently. The team checks their eval set and finds no cases covering this pattern. What caused the gap?

Q2

A team wants to use an LLM to generate reference answers for their eval dataset at scale. They plan to use these LLM-generated answers directly as ground truth without human review. What is the risk?

Q3

A team's eval set has 200 cases distributed evenly across 10 categories (20 per category). Their production query distribution is 60% simple lookups, 15% comparisons, and 25% distributed across 8 other categories. An engineer proposes keeping the even distribution because it gives 'fair' coverage of all categories. What is the problem with this reasoning?

Q4

A team iteratively improves their system prompt until it passes 95% of their eval cases. They then promote to production and observe poor quality on real traffic. What anti-pattern does this describe?

Q5

A production LLM system has been running for 6 months. The team has been regularly adding production query samples to their eval set. A new engineer suggests removing old hand-crafted cases that haven't failed in 6 months to keep the eval set lean. Should they remove these cases?