What Makes LLM Evaluation Hard: AI Explained

Layer 1: Surface

Traditional software is deterministic. A function that converts Fahrenheit to Celsius gives the same answer every time. You can write a unit test that either passes or fails. LLM outputs are not deterministic, and even when they are consistent, “correctness” is often a matter of degree.

This structural difference means the entire evaluation toolbox from classical software engineering needs to be rebuilt. You cannot assertEqual a summary. You cannot unit-test a chatbot response. You need a different discipline.

LLM evaluation is hard for three specific reasons:

No single ground truth. For most tasks, summarization, question answering, creative generation, there is no one right answer. Multiple outputs can all be correct. A traditional pass/fail comparison against a reference answer will give misleading results.

Open-ended output space. A classification model can output one of N labels. An LLM can output arbitrary text. The number of possible failure modes is effectively unbounded, which means you cannot enumerate them all up front.

Context-dependent quality. An output that is excellent for a technical audience may be terrible for a general consumer. An answer that is correct in English may be wrong if your user’s query was in another language. Quality is not a property of the output alone: it depends on who is reading it and why.

Why it matters

Without a clear eval strategy, you are shipping blind. Prompt changes, model upgrades, RAG index updates, and tool schema changes all affect output quality without changing any code in an obvious way. The only way to know if a change made things better or worse is to measure it systematically.

Production Gotcha

Common Gotcha: A single aggregate accuracy number hides the distribution of failures, a system that's 95% correct may fail systematically on a specific query category that matters most to your users. Always segment eval results by query type, user segment, or input length before reporting a headline number. An 80% average with 0% on your highest-value query category is not an 80% system, it is a broken system for your most important users.

This happens because aggregate metrics are easy to report and hard to argue with. Segmentation requires more work but is the only way to surface the failure patterns that matter.

Layer 2: Guided

The three axes of eval design

Every eval decision you make sits on one of three axes:

1. What to measure

Dimension	What it captures	How you measure it
Correctness	Does the output contain the right information?	Reference comparison, LLM judge
Safety	Does the output avoid harmful content?	Policy classifier, human review
Cost	How many tokens does this feature consume?	Token counter, cost attribution
Latency	How long does the user wait?	Timing instrumentation
User satisfaction	Do users find the output useful?	Thumbs up/down, session signals

Do not try to capture all five at once when you are starting out. Correctness and safety are table stakes; add cost and latency once those are stable.

2. How to measure it

from dataclasses import dataclass
from enum import Enum
from typing import Callable

class EvalMethod(Enum):
    EXACT_MATCH = "exact_match"       # output == expected; works for classification
    PROGRAMMATIC = "programmatic"     # schema check, regex, code execution
    LLM_JUDGE = "llm_judge"          # strong model grades the output
    HUMAN = "human"                   # human annotator reviews

@dataclass
class EvalCase:
    id: str
    input: str
    expected: str | None       # None for reference-free evals
    method: EvalMethod
    rubric: str | None = None  # used for LLM_JUDGE cases

def run_eval_case(case: EvalCase, system_fn: Callable[[str], str]) -> dict:
    output = system_fn(case.input)

    if case.method == EvalMethod.EXACT_MATCH:
        passed = output.strip().lower() == (case.expected or "").strip().lower()
        return {"id": case.id, "passed": passed, "output": output}

    if case.method == EvalMethod.PROGRAMMATIC:
        # Caller provides the rubric as a callable check function name
        # For this example, we check JSON validity
        try:
            import json
            json.loads(output)
            return {"id": case.id, "passed": True, "output": output}
        except Exception:
            return {"id": case.id, "passed": False, "output": output}

    if case.method == EvalMethod.LLM_JUDGE:
        score = judge_output(case.input, output, case.rubric or "")
        return {"id": case.id, "passed": score >= 3, "score": score, "output": output}

    raise ValueError(f"Method {case.method} requires external review")

def judge_output(user_input: str, output: str, rubric: str) -> int:
    """Returns a score from 1 to 5 using a strong model as judge."""
    response = llm.chat(
        model="frontier",
        messages=[{
            "role": "user",
            "content": (
                f"Rate this response 1–5 based on the rubric.\n\n"
                f"Rubric: {rubric}\n\n"
                f"User input: {user_input}\n\n"
                f"Response: {output}\n\n"
                f"Output only an integer 1–5."
            )
        }]
    )
    try:
        return int(response.text.strip())
    except ValueError:
        return 1

3. When to measure it

# The three eval windows — each catches different failure classes

EVAL_TIMING = {
    "offline_pre_deploy": {
        "when": "Before every merge to main",
        "what": "Regression suite — known-good cases, edge cases, red-team cases",
        "latency_budget": "Under 10 minutes for a full CI gate",
        "catches": "Regressions from prompt changes, model upgrades, schema changes",
    },
    "offline_experiment": {
        "when": "During development, before opening a PR",
        "what": "Full eval suite including quality metrics",
        "latency_budget": "Under 30 minutes; run in background",
        "catches": "Quality regressions before they become PRs",
    },
    "online_production": {
        "when": "Continuously, on live traffic",
        "what": "Sample-and-judge pipeline, user feedback signals, cost metrics",
        "latency_budget": "Async — does not block requests",
        "catches": "Distribution shift, long-tail failures, model drift over time",
    },
}

Reference-based vs reference-free evaluation

Reference-based eval compares the system output against a known-good answer. It is high-precision when the reference is correct, but requires maintaining a labelled dataset. It also penalizes paraphrases: a correct answer phrased differently from the reference may score poorly.

Reference-free eval judges the output without a reference. An LLM judge, given only the input and the output, can assess whether the output is factually plausible, complete, and appropriately toned: without needing a gold answer. This scales better but is less precise.

def reference_based_score(output: str, reference: str) -> float:
    """Simple token-overlap score (F1). Better metrics exist — see module 6.3."""
    output_tokens = set(output.lower().split())
    reference_tokens = set(reference.lower().split())
    if not reference_tokens:
        return 0.0
    overlap = output_tokens & reference_tokens
    precision = len(overlap) / len(output_tokens) if output_tokens else 0
    recall = len(overlap) / len(reference_tokens)
    if precision + recall == 0:
        return 0.0
    return 2 * precision * recall / (precision + recall)

def reference_free_score(user_input: str, output: str) -> int:
    """LLM judge without a reference answer — scores 1–5."""
    response = llm.chat(
        model="frontier",
        messages=[{
            "role": "user",
            "content": (
                f"The user asked: {user_input}\n\n"
                f"The system responded: {output}\n\n"
                f"Rate the response 1 (very poor) to 5 (excellent) on accuracy and helpfulness. "
                f"Output only an integer."
            )
        }]
    )
    try:
        return int(response.text.strip())
    except ValueError:
        return 1

A brief taxonomy of eval types

Eval type	Purpose	When to run
Unit eval	Test one component (prompt, retriever, parser) in isolation	During development
Integration eval	Test the full pipeline end-to-end	Before every deploy
Regression eval	Ensure a change didn’t break previously passing cases	On every PR
A/B eval	Compare two versions on the same inputs	When promoting a change
Red-team eval	Discover failure modes via adversarial inputs	Before major releases

The rest of this track covers each in depth. This module has given you the vocabulary and the mental model. The key insight to carry forward: there is no single right answer for how to evaluate an LLM system: the method, timing, and coverage must be designed for the specific system and the specific quality dimensions that matter to your users.

Layer 3: Deep Dive

Why aggregate metrics are dangerous

Consider a system that handles five query categories with the following correctness rates:

Category	Correctness	Query share
General FAQ	99%	50%
Product lookup	98%	30%
Billing inquiry	95%	15%
Account closure	20%	4%
Escalation routing	10%	1%

Weighted aggregate: approximately 95%: which looks excellent. But the two lowest-performing categories are precisely the highest-stakes ones: account closure and escalation routing. A user who needs to close their account is failing 80% of the time. No aggregate metric reveals this.

The discipline of segmented evaluation requires:

Taxonomy-first thinking: Before writing a single eval case, categorize your input space. What are the distinct query types? What are the distinct user segments? What are the distinct output dimensions you care about?
Per-category baselines: Set a threshold not just for the aggregate but for each category. If account closure drops below 80%, that is a blocking failure regardless of the aggregate.
Category drift monitoring: Track the distribution of queries over time. A category that was rare at launch may become dominant as user behavior changes.

The eval coverage problem

You cannot enumerate all failure modes before deployment. This is not a limitation of your process: it is a structural property of open-ended generation. The space of possible inputs is infinite; the space of possible failure modes is unbounded.

What you can do:

Cover known categories exhaustively: the cases you can anticipate should be well-represented
Sample from the long tail: production queries surface failure modes that development testing never will
Use adversarial construction: systematically probe boundaries (empty inputs, very long inputs, multilingual inputs, injected instructions)
Track uncovered failures: when something fails in production, add it to the eval set immediately

The eval coverage problem does not have a solution; it has a practice. The practice is continuous: add cases, measure coverage, add more cases.

Eval design decisions table

Decision	Options	When to use each
Ground truth source	Human annotation	When quality is subjective and stakes are high
	LLM-generated + human review	When volume is high and you want speed with oversight
	Programmatic (schema, regex)	When output has a well-defined structure
Judge model	Same model being evaluated	Avoid: self-favouritism bias
	Stronger model of same family	Acceptable for quality tasks
	Different model family	Preferred for safety and bias auditing
Scoring scale	Binary pass/fail	Classification, structured output tasks
	1–5 Likert	Quality dimensions where degree matters
	Pairwise comparison	When absolute scales are hard to calibrate

What Makes LLM Evaluation Hard