🤖 AI Explained
Emerging area 5 min read

What Makes LLM Evaluation Hard

Learn why LLM eval is structurally different from traditional ML testing, what the three axes of eval design are, and how to build a mental model for the rest of the track.

Layer 1: Surface

Traditional software is deterministic. A function that converts Fahrenheit to Celsius gives the same answer every time. You can write a unit test that either passes or fails. LLM outputs are not deterministic, and even when they are consistent, “correctness” is often a matter of degree.

This structural difference means the entire evaluation toolbox from classical software engineering needs to be rebuilt. You cannot assertEqual a summary. You cannot unit-test a chatbot response. You need a different discipline.

LLM evaluation is hard for three specific reasons:

No single ground truth. For most tasks, summarization, question answering, creative generation, there is no one right answer. Multiple outputs can all be correct. A traditional pass/fail comparison against a reference answer will give misleading results.

Open-ended output space. A classification model can output one of N labels. An LLM can output arbitrary text. The number of possible failure modes is effectively unbounded, which means you cannot enumerate them all up front.

Context-dependent quality. An output that is excellent for a technical audience may be terrible for a general consumer. An answer that is correct in English may be wrong if your user’s query was in another language. Quality is not a property of the output alone: it depends on who is reading it and why.

Why it matters

Without a clear eval strategy, you are shipping blind. Prompt changes, model upgrades, RAG index updates, and tool schema changes all affect output quality without changing any code in an obvious way. The only way to know if a change made things better or worse is to measure it systematically.

Production Gotcha

Common Gotcha: A single aggregate accuracy number hides the distribution of failures, a system that's 95% correct may fail systematically on a specific query category that matters most to your users. Always segment eval results by query type, user segment, or input length before reporting a headline number. An 80% average with 0% on your highest-value query category is not an 80% system, it is a broken system for your most important users.

This happens because aggregate metrics are easy to report and hard to argue with. Segmentation requires more work but is the only way to surface the failure patterns that matter.


Layer 2: Guided

The three axes of eval design

Every eval decision you make sits on one of three axes:

1. What to measure

DimensionWhat it capturesHow you measure it
CorrectnessDoes the output contain the right information?Reference comparison, LLM judge
SafetyDoes the output avoid harmful content?Policy classifier, human review
CostHow many tokens does this feature consume?Token counter, cost attribution
LatencyHow long does the user wait?Timing instrumentation
User satisfactionDo users find the output useful?Thumbs up/down, session signals

Do not try to capture all five at once when you are starting out. Correctness and safety are table stakes; add cost and latency once those are stable.

2. How to measure it

from dataclasses import dataclass
from enum import Enum
from typing import Callable

class EvalMethod(Enum):
    EXACT_MATCH = "exact_match"       # output == expected; works for classification
    PROGRAMMATIC = "programmatic"     # schema check, regex, code execution
    LLM_JUDGE = "llm_judge"          # strong model grades the output
    HUMAN = "human"                   # human annotator reviews

@dataclass
class EvalCase:
    id: str
    input: str
    expected: str | None       # None for reference-free evals
    method: EvalMethod
    rubric: str | None = None  # used for LLM_JUDGE cases

def run_eval_case(case: EvalCase, system_fn: Callable[[str], str]) -> dict:
    output = system_fn(case.input)

    if case.method == EvalMethod.EXACT_MATCH:
        passed = output.strip().lower() == (case.expected or "").strip().lower()
        return {"id": case.id, "passed": passed, "output": output}

    if case.method == EvalMethod.PROGRAMMATIC:
        # Caller provides the rubric as a callable check function name
        # For this example, we check JSON validity
        try:
            import json
            json.loads(output)
            return {"id": case.id, "passed": True, "output": output}
        except Exception:
            return {"id": case.id, "passed": False, "output": output}

    if case.method == EvalMethod.LLM_JUDGE:
        score = judge_output(case.input, output, case.rubric or "")
        return {"id": case.id, "passed": score >= 3, "score": score, "output": output}

    raise ValueError(f"Method {case.method} requires external review")

def judge_output(user_input: str, output: str, rubric: str) -> int:
    """Returns a score from 1 to 5 using a strong model as judge."""
    response = llm.chat(
        model="frontier",
        messages=[{
            "role": "user",
            "content": (
                f"Rate this response 1–5 based on the rubric.\n\n"
                f"Rubric: {rubric}\n\n"
                f"User input: {user_input}\n\n"
                f"Response: {output}\n\n"
                f"Output only an integer 1–5."
            )
        }]
    )
    try:
        return int(response.text.strip())
    except ValueError:
        return 1

3. When to measure it

# The three eval windows — each catches different failure classes

EVAL_TIMING = {
    "offline_pre_deploy": {
        "when": "Before every merge to main",
        "what": "Regression suite — known-good cases, edge cases, red-team cases",
        "latency_budget": "Under 10 minutes for a full CI gate",
        "catches": "Regressions from prompt changes, model upgrades, schema changes",
    },
    "offline_experiment": {
        "when": "During development, before opening a PR",
        "what": "Full eval suite including quality metrics",
        "latency_budget": "Under 30 minutes; run in background",
        "catches": "Quality regressions before they become PRs",
    },
    "online_production": {
        "when": "Continuously, on live traffic",
        "what": "Sample-and-judge pipeline, user feedback signals, cost metrics",
        "latency_budget": "Async — does not block requests",
        "catches": "Distribution shift, long-tail failures, model drift over time",
    },
}

Reference-based vs reference-free evaluation

Reference-based eval compares the system output against a known-good answer. It is high-precision when the reference is correct, but requires maintaining a labelled dataset. It also penalizes paraphrases: a correct answer phrased differently from the reference may score poorly.

Reference-free eval judges the output without a reference. An LLM judge, given only the input and the output, can assess whether the output is factually plausible, complete, and appropriately toned: without needing a gold answer. This scales better but is less precise.

def reference_based_score(output: str, reference: str) -> float:
    """Simple token-overlap score (F1). Better metrics exist — see module 6.3."""
    output_tokens = set(output.lower().split())
    reference_tokens = set(reference.lower().split())
    if not reference_tokens:
        return 0.0
    overlap = output_tokens & reference_tokens
    precision = len(overlap) / len(output_tokens) if output_tokens else 0
    recall = len(overlap) / len(reference_tokens)
    if precision + recall == 0:
        return 0.0
    return 2 * precision * recall / (precision + recall)

def reference_free_score(user_input: str, output: str) -> int:
    """LLM judge without a reference answer — scores 1–5."""
    response = llm.chat(
        model="frontier",
        messages=[{
            "role": "user",
            "content": (
                f"The user asked: {user_input}\n\n"
                f"The system responded: {output}\n\n"
                f"Rate the response 1 (very poor) to 5 (excellent) on accuracy and helpfulness. "
                f"Output only an integer."
            )
        }]
    )
    try:
        return int(response.text.strip())
    except ValueError:
        return 1

A brief taxonomy of eval types

Eval typePurposeWhen to run
Unit evalTest one component (prompt, retriever, parser) in isolationDuring development
Integration evalTest the full pipeline end-to-endBefore every deploy
Regression evalEnsure a change didn’t break previously passing casesOn every PR
A/B evalCompare two versions on the same inputsWhen promoting a change
Red-team evalDiscover failure modes via adversarial inputsBefore major releases

The rest of this track covers each in depth. This module has given you the vocabulary and the mental model. The key insight to carry forward: there is no single right answer for how to evaluate an LLM system: the method, timing, and coverage must be designed for the specific system and the specific quality dimensions that matter to your users.


Layer 3: Deep Dive

Why aggregate metrics are dangerous

Consider a system that handles five query categories with the following correctness rates:

CategoryCorrectnessQuery share
General FAQ99%50%
Product lookup98%30%
Billing inquiry95%15%
Account closure20%4%
Escalation routing10%1%

Weighted aggregate: approximately 95%: which looks excellent. But the two lowest-performing categories are precisely the highest-stakes ones: account closure and escalation routing. A user who needs to close their account is failing 80% of the time. No aggregate metric reveals this.

The discipline of segmented evaluation requires:

  1. Taxonomy-first thinking: Before writing a single eval case, categorize your input space. What are the distinct query types? What are the distinct user segments? What are the distinct output dimensions you care about?

  2. Per-category baselines: Set a threshold not just for the aggregate but for each category. If account closure drops below 80%, that is a blocking failure regardless of the aggregate.

  3. Category drift monitoring: Track the distribution of queries over time. A category that was rare at launch may become dominant as user behavior changes.

The eval coverage problem

You cannot enumerate all failure modes before deployment. This is not a limitation of your process: it is a structural property of open-ended generation. The space of possible inputs is infinite; the space of possible failure modes is unbounded.

What you can do:

  • Cover known categories exhaustively: the cases you can anticipate should be well-represented
  • Sample from the long tail: production queries surface failure modes that development testing never will
  • Use adversarial construction: systematically probe boundaries (empty inputs, very long inputs, multilingual inputs, injected instructions)
  • Track uncovered failures: when something fails in production, add it to the eval set immediately

The eval coverage problem does not have a solution; it has a practice. The practice is continuous: add cases, measure coverage, add more cases.

Eval design decisions table

DecisionOptionsWhen to use each
Ground truth sourceHuman annotationWhen quality is subjective and stakes are high
LLM-generated + human reviewWhen volume is high and you want speed with oversight
Programmatic (schema, regex)When output has a well-defined structure
Judge modelSame model being evaluatedAvoid: self-favouritism bias
Stronger model of same familyAcceptable for quality tasks
Different model familyPreferred for safety and bias auditing
Scoring scaleBinary pass/failClassification, structured output tasks
1–5 LikertQuality dimensions where degree matters
Pairwise comparisonWhen absolute scales are hard to calibrate

Further reading

✏ Suggest an edit on GitHub

What Makes LLM Evaluation Hard: Check your understanding

Q1

A team builds an LLM-powered support chatbot and reports 95% accuracy on their eval set. A product manager asks for a breakdown by query type. The team discovers that account closure queries, which represent 4% of volume but are the highest-stakes category, score only 18% correct. What does this reveal about the reported 95% accuracy?

Q2

A team is evaluating whether to use exact match, LLM-as-judge, or semantic similarity for their new FAQ chatbot. The chatbot answers open-ended customer questions in natural language. Which method is most appropriate for measuring correctness?

Q3

A team runs their offline eval suite before every deploy. They notice that failures in production often involve query types not present in the eval set. What structural property of LLM evaluation does this illustrate?

Q4

A team needs to evaluate their LLM-based code generation feature. The model is supposed to produce Python functions that pass a given set of test cases. Which eval method is most appropriate?

Q5

Which two eval types are complementary rather than substitutes: neither alone is sufficient for a production system?