Layer 1: Surface
Traditional software is deterministic. A function that converts Fahrenheit to Celsius gives the same answer every time. You can write a unit test that either passes or fails. LLM outputs are not deterministic, and even when they are consistent, “correctness” is often a matter of degree.
This structural difference means the entire evaluation toolbox from classical software engineering needs to be rebuilt. You cannot assertEqual a summary. You cannot unit-test a chatbot response. You need a different discipline.
LLM evaluation is hard for three specific reasons:
No single ground truth. For most tasks, summarization, question answering, creative generation, there is no one right answer. Multiple outputs can all be correct. A traditional pass/fail comparison against a reference answer will give misleading results.
Open-ended output space. A classification model can output one of N labels. An LLM can output arbitrary text. The number of possible failure modes is effectively unbounded, which means you cannot enumerate them all up front.
Context-dependent quality. An output that is excellent for a technical audience may be terrible for a general consumer. An answer that is correct in English may be wrong if your user’s query was in another language. Quality is not a property of the output alone: it depends on who is reading it and why.
Why it matters
Without a clear eval strategy, you are shipping blind. Prompt changes, model upgrades, RAG index updates, and tool schema changes all affect output quality without changing any code in an obvious way. The only way to know if a change made things better or worse is to measure it systematically.
Production Gotcha
Common Gotcha: A single aggregate accuracy number hides the distribution of failures, a system that's 95% correct may fail systematically on a specific query category that matters most to your users. Always segment eval results by query type, user segment, or input length before reporting a headline number. An 80% average with 0% on your highest-value query category is not an 80% system, it is a broken system for your most important users.
This happens because aggregate metrics are easy to report and hard to argue with. Segmentation requires more work but is the only way to surface the failure patterns that matter.
Layer 2: Guided
The three axes of eval design
Every eval decision you make sits on one of three axes:
1. What to measure
| Dimension | What it captures | How you measure it |
|---|---|---|
| Correctness | Does the output contain the right information? | Reference comparison, LLM judge |
| Safety | Does the output avoid harmful content? | Policy classifier, human review |
| Cost | How many tokens does this feature consume? | Token counter, cost attribution |
| Latency | How long does the user wait? | Timing instrumentation |
| User satisfaction | Do users find the output useful? | Thumbs up/down, session signals |
Do not try to capture all five at once when you are starting out. Correctness and safety are table stakes; add cost and latency once those are stable.
2. How to measure it
from dataclasses import dataclass
from enum import Enum
from typing import Callable
class EvalMethod(Enum):
EXACT_MATCH = "exact_match" # output == expected; works for classification
PROGRAMMATIC = "programmatic" # schema check, regex, code execution
LLM_JUDGE = "llm_judge" # strong model grades the output
HUMAN = "human" # human annotator reviews
@dataclass
class EvalCase:
id: str
input: str
expected: str | None # None for reference-free evals
method: EvalMethod
rubric: str | None = None # used for LLM_JUDGE cases
def run_eval_case(case: EvalCase, system_fn: Callable[[str], str]) -> dict:
output = system_fn(case.input)
if case.method == EvalMethod.EXACT_MATCH:
passed = output.strip().lower() == (case.expected or "").strip().lower()
return {"id": case.id, "passed": passed, "output": output}
if case.method == EvalMethod.PROGRAMMATIC:
# Caller provides the rubric as a callable check function name
# For this example, we check JSON validity
try:
import json
json.loads(output)
return {"id": case.id, "passed": True, "output": output}
except Exception:
return {"id": case.id, "passed": False, "output": output}
if case.method == EvalMethod.LLM_JUDGE:
score = judge_output(case.input, output, case.rubric or "")
return {"id": case.id, "passed": score >= 3, "score": score, "output": output}
raise ValueError(f"Method {case.method} requires external review")
def judge_output(user_input: str, output: str, rubric: str) -> int:
"""Returns a score from 1 to 5 using a strong model as judge."""
response = llm.chat(
model="frontier",
messages=[{
"role": "user",
"content": (
f"Rate this response 1–5 based on the rubric.\n\n"
f"Rubric: {rubric}\n\n"
f"User input: {user_input}\n\n"
f"Response: {output}\n\n"
f"Output only an integer 1–5."
)
}]
)
try:
return int(response.text.strip())
except ValueError:
return 1
3. When to measure it
# The three eval windows — each catches different failure classes
EVAL_TIMING = {
"offline_pre_deploy": {
"when": "Before every merge to main",
"what": "Regression suite — known-good cases, edge cases, red-team cases",
"latency_budget": "Under 10 minutes for a full CI gate",
"catches": "Regressions from prompt changes, model upgrades, schema changes",
},
"offline_experiment": {
"when": "During development, before opening a PR",
"what": "Full eval suite including quality metrics",
"latency_budget": "Under 30 minutes; run in background",
"catches": "Quality regressions before they become PRs",
},
"online_production": {
"when": "Continuously, on live traffic",
"what": "Sample-and-judge pipeline, user feedback signals, cost metrics",
"latency_budget": "Async — does not block requests",
"catches": "Distribution shift, long-tail failures, model drift over time",
},
}
Reference-based vs reference-free evaluation
Reference-based eval compares the system output against a known-good answer. It is high-precision when the reference is correct, but requires maintaining a labelled dataset. It also penalizes paraphrases: a correct answer phrased differently from the reference may score poorly.
Reference-free eval judges the output without a reference. An LLM judge, given only the input and the output, can assess whether the output is factually plausible, complete, and appropriately toned: without needing a gold answer. This scales better but is less precise.
def reference_based_score(output: str, reference: str) -> float:
"""Simple token-overlap score (F1). Better metrics exist — see module 6.3."""
output_tokens = set(output.lower().split())
reference_tokens = set(reference.lower().split())
if not reference_tokens:
return 0.0
overlap = output_tokens & reference_tokens
precision = len(overlap) / len(output_tokens) if output_tokens else 0
recall = len(overlap) / len(reference_tokens)
if precision + recall == 0:
return 0.0
return 2 * precision * recall / (precision + recall)
def reference_free_score(user_input: str, output: str) -> int:
"""LLM judge without a reference answer — scores 1–5."""
response = llm.chat(
model="frontier",
messages=[{
"role": "user",
"content": (
f"The user asked: {user_input}\n\n"
f"The system responded: {output}\n\n"
f"Rate the response 1 (very poor) to 5 (excellent) on accuracy and helpfulness. "
f"Output only an integer."
)
}]
)
try:
return int(response.text.strip())
except ValueError:
return 1
A brief taxonomy of eval types
| Eval type | Purpose | When to run |
|---|---|---|
| Unit eval | Test one component (prompt, retriever, parser) in isolation | During development |
| Integration eval | Test the full pipeline end-to-end | Before every deploy |
| Regression eval | Ensure a change didn’t break previously passing cases | On every PR |
| A/B eval | Compare two versions on the same inputs | When promoting a change |
| Red-team eval | Discover failure modes via adversarial inputs | Before major releases |
The rest of this track covers each in depth. This module has given you the vocabulary and the mental model. The key insight to carry forward: there is no single right answer for how to evaluate an LLM system: the method, timing, and coverage must be designed for the specific system and the specific quality dimensions that matter to your users.
Layer 3: Deep Dive
Why aggregate metrics are dangerous
Consider a system that handles five query categories with the following correctness rates:
| Category | Correctness | Query share |
|---|---|---|
| General FAQ | 99% | 50% |
| Product lookup | 98% | 30% |
| Billing inquiry | 95% | 15% |
| Account closure | 20% | 4% |
| Escalation routing | 10% | 1% |
Weighted aggregate: approximately 95%: which looks excellent. But the two lowest-performing categories are precisely the highest-stakes ones: account closure and escalation routing. A user who needs to close their account is failing 80% of the time. No aggregate metric reveals this.
The discipline of segmented evaluation requires:
-
Taxonomy-first thinking: Before writing a single eval case, categorize your input space. What are the distinct query types? What are the distinct user segments? What are the distinct output dimensions you care about?
-
Per-category baselines: Set a threshold not just for the aggregate but for each category. If account closure drops below 80%, that is a blocking failure regardless of the aggregate.
-
Category drift monitoring: Track the distribution of queries over time. A category that was rare at launch may become dominant as user behavior changes.
The eval coverage problem
You cannot enumerate all failure modes before deployment. This is not a limitation of your process: it is a structural property of open-ended generation. The space of possible inputs is infinite; the space of possible failure modes is unbounded.
What you can do:
- Cover known categories exhaustively: the cases you can anticipate should be well-represented
- Sample from the long tail: production queries surface failure modes that development testing never will
- Use adversarial construction: systematically probe boundaries (empty inputs, very long inputs, multilingual inputs, injected instructions)
- Track uncovered failures: when something fails in production, add it to the eval set immediately
The eval coverage problem does not have a solution; it has a practice. The practice is continuous: add cases, measure coverage, add more cases.
Eval design decisions table
| Decision | Options | When to use each |
|---|---|---|
| Ground truth source | Human annotation | When quality is subjective and stakes are high |
| LLM-generated + human review | When volume is high and you want speed with oversight | |
| Programmatic (schema, regex) | When output has a well-defined structure | |
| Judge model | Same model being evaluated | Avoid: self-favouritism bias |
| Stronger model of same family | Acceptable for quality tasks | |
| Different model family | Preferred for safety and bias auditing | |
| Scoring scale | Binary pass/fail | Classification, structured output tasks |
| 1–5 Likert | Quality dimensions where degree matters | |
| Pairwise comparison | When absolute scales are hard to calibrate |
Further reading
- A Survey on Evaluation of Large Language Models; Chang et al., 2023. Comprehensive taxonomy of LLM eval approaches, metrics, and benchmarks; the three-axis framing in this module is consistent with their categorical breakdown.
- Holistic Evaluation of Language Models (HELM); Stanford CRFM. Multi-dimensional evaluation framework; the per-scenario breakdown is the practical implementation of segmented evaluation.
- Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference; Chiang et al., 2024. Pairwise comparison at scale; shows why aggregate win-rates hide per-category differences.