Layer 1: Surface
A knowledge distillation pipeline uses a large, capable model (the teacher) to generate examples that a smaller model (the student) then trains on. Done well, the student learns to match or approximate the teacher’s performance on a specific task at a fraction of the inference cost.
This is distinct from using synthetic data for evaluation (covered in track 6). Training data problems are more dangerous: they bake failure modes into the model weights permanently. A bad eval dataset gives you a wrong number. A bad training dataset gives you a broken model.
The core pipeline:
Real-world inputs → Teacher model → Synthetic outputs (labels, reasoning, answers)
↓
Quality filtering
↓
Student model training
↓
Evaluation against held-out real examples
When synthetic data genuinely helps:
| Scenario | Why it works |
|---|---|
| You need 10,000 labelled examples and have 200 real ones | Augments scarce real data |
| You want a specialised small model for a specific task | Teacher knowledge transferred to smaller student |
| Labelling real data is expensive (legal, medical review) | LLM generation is cheaper than expert annotation |
| You need diverse edge cases | LLMs can generate systematic variations |
When it hurts:
| Scenario | Risk |
|---|---|
| Teacher and student are from the same model family | Circular reinforcement — student learns teacher’s biases |
| No quality filtering | Hallucinations and errors go into training weights |
| No evaluation against real held-out data | Mode collapse is invisible until production |
| Teacher has known gaps on your domain | Student inherits and amplifies those gaps |
Production Gotcha: Synthetic data generated by the same model family you’re training on creates circular reinforcement — the student learns the teacher’s biases and failure modes. Always validate synthetic training data against a held-out set of real examples and monitor for mode collapse.
Layer 2: Guided
Building a distillation pipeline with quality filtering
The following builds a supervised fine-tuning (SFT) dataset by using a teacher model to generate answers, filtering low-quality outputs, and preparing the result for student training.
import json
import re
from anthropic import Anthropic
client = Anthropic()
# A real-world set of questions from your domain (these must be real, not synthetic)
seed_questions = [
"What happens when a Kubernetes pod exceeds its memory limit?",
"Explain the difference between a ClusterIP and a NodePort service.",
"What does kubectl rollout undo do and when should you use it?",
]
def generate_teacher_answer(question: str) -> dict:
"""Generate a detailed answer from the teacher model."""
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": (
f"Answer the following technical question clearly and accurately. "
f"If you are uncertain about any detail, say so explicitly.\n\n"
f"Question: {question}"
)
}]
)
return {
"question": question,
"answer": response.content[0].text,
"model": "claude-opus-4-5",
}
def quality_filter(example: dict) -> bool:
"""
Filter out examples that signal low quality.
In production, supplement with an LLM-as-judge scoring step.
"""
answer = example["answer"]
# Reject if the teacher expresses uncertainty — these make bad training examples
uncertainty_markers = [
"i'm not sure",
"i don't know",
"i'm uncertain",
"i cannot confirm",
"as of my knowledge cutoff",
]
if any(marker in answer.lower() for marker in uncertainty_markers):
return False
# Reject very short answers — likely incomplete
if len(answer.split()) < 50:
return False
# Reject if the answer doesn't address the question (simple heuristic: check for question keywords)
question_keywords = set(re.findall(r'\b\w{5,}\b', example["question"].lower()))
answer_keywords = set(re.findall(r'\b\w{5,}\b', answer.lower()))
overlap = len(question_keywords & answer_keywords) / max(len(question_keywords), 1)
if overlap < 0.2:
return False
return True
def format_for_sft(example: dict) -> dict:
"""Format into the standard chat SFT format expected by most training libraries."""
return {
"messages": [
{
"role": "system",
"content": "You are a helpful Kubernetes expert. Answer accurately and clearly."
},
{
"role": "user",
"content": example["question"]
},
{
"role": "assistant",
"content": example["answer"]
}
]
}
# Generate and filter
raw_examples = [generate_teacher_answer(q) for q in seed_questions]
filtered_examples = [e for e in raw_examples if quality_filter(e)]
training_examples = [format_for_sft(e) for e in filtered_examples]
print(f"Generated: {len(raw_examples)}, After filtering: {len(filtered_examples)}")
# Save as JSONL for training
with open("training_data.jsonl", "w") as f:
for example in training_examples:
f.write(json.dumps(example) + "\n")
Adding LLM-as-judge quality scoring
Simple heuristic filtering misses subtler failures. Add a scoring step that uses a separate model as judge:
def score_example(example: dict, judge_client: Anthropic) -> float:
"""Score an example 0.0–1.0 using a judge model."""
judge_prompt = f"""Rate the quality of this answer on a scale of 0.0 to 1.0.
Question: {example['question']}
Answer: {example['answer']}
Score on:
- Accuracy (0.4 weight): Is the answer factually correct?
- Completeness (0.3 weight): Does it fully address the question?
- Clarity (0.3 weight): Is it clearly written without unnecessary caveats?
Return only a JSON object: {{"score": 0.0, "reason": "brief reason"}}"""
response = judge_client.messages.create(
model="claude-opus-4-5",
max_tokens=200,
messages=[{"role": "user", "content": judge_prompt}]
)
result = json.loads(response.content[0].text)
return result["score"]
# Use a different judge model than the teacher where possible
# At minimum, use a different sampling run or a cross-model judge
QUALITY_THRESHOLD = 0.75
scored_examples = []
for example in filtered_examples:
score = score_example(example, client)
if score >= QUALITY_THRESHOLD:
scored_examples.append(example)
Validating against real held-out data
This is the step that most teams skip. Always hold out a set of real, human-verified examples and evaluate the student model against them after training:
# held_out_real_examples: 50-200 real examples with human-verified answers
# This set is never used for training or synthetic generation
def evaluate_on_held_out(model, held_out_examples: list[dict]) -> dict:
correct = 0
for example in held_out_examples:
student_answer = model.generate(example["question"])
# Use your task-specific evaluation metric here
# For exact match: simple comparison
# For open-ended: LLM judge with consistent rubric
if matches_reference(student_answer, example["reference_answer"]):
correct += 1
return {
"accuracy": correct / len(held_out_examples),
"n": len(held_out_examples),
}
If the student performs well on synthetic eval but poorly on held-out real examples, the synthetic data has diverged from the real distribution — a signal of mode collapse or circular reinforcement.
Layer 3: Deep Dive
The theory of knowledge distillation
Knowledge distillation was formalised by Hinton et al. (2015) in the context of neural network compression, predating LLMs. The core idea: a student trained on a teacher’s soft outputs (probability distributions over the output space) learns more efficiently than a student trained on hard labels alone, because the soft outputs carry information about relationships between classes and the teacher’s uncertainty.
For LLMs, this translates as: training the student on the teacher’s chain-of-thought reasoning, not just final answers, transfers more capability than final-answer distillation alone. The student learns the reasoning path, not just the output.
Why chain-of-thought distillation works:
When a teacher model generates step-by-step reasoning and the student trains on those steps, the student learns to decompose problems similarly. Magister et al. (2022) showed that training a 540M parameter student on GPT-3’s chain-of-thought traces yielded arithmetic reasoning performance matching models 20× larger — a significant jump from final-answer distillation alone.
Mode collapse in synthetic pipelines
Mode collapse refers to the student model converging on a narrow distribution of outputs — answering all variations of a question the same way, or losing the ability to generate diverse valid outputs that the teacher could produce.
The mechanism: synthetic data pipelines generate answers using sampling at some temperature. If the seed questions are narrow, the teacher’s outputs cluster around similar patterns. The student trains on this cluster, and its distribution collapses toward it. Subsequent generations (if the student is used as a new teacher) collapse further.
Detecting mode collapse:
from collections import Counter
import statistics
def measure_output_diversity(model_outputs: list[str]) -> dict:
"""Measure diversity of model outputs as a proxy for mode collapse."""
# Measure vocabulary diversity (unique tokens per response)
vocab_sizes = [len(set(output.split())) for output in model_outputs]
# Measure response length variance (low variance = mode collapse)
lengths = [len(output.split()) for output in model_outputs]
# Measure n-gram repetition across outputs
bigrams = []
for output in model_outputs:
words = output.split()
bigrams.extend(zip(words, words[1:]))
bigram_counts = Counter(bigrams)
top_bigram_freq = bigram_counts.most_common(1)[0][1] / len(bigrams) if bigrams else 0
return {
"mean_vocab_size": statistics.mean(vocab_sizes),
"length_std_dev": statistics.stdev(lengths) if len(lengths) > 1 else 0,
"top_bigram_frequency": top_bigram_freq,
}
A collapsing model shows: decreasing vocabulary size across responses, shrinking length variance, and increasing repetition of specific bigrams.
Named failure modes
Circular reinforcement. Training on outputs from the same model family as the student amplifies existing biases. The student learns not just the teacher’s competence but also its failure patterns — including systematic biases in how it frames certain topics, its characteristic errors on specific reasoning tasks, and its gaps on underrepresented domains. Use cross-family teachers (e.g., a Gemini teacher for a Llama student) wherever possible.
Hallucination injection. If quality filtering fails to remove examples where the teacher hallucinated, those hallucinations become training signal. The student learns to generate that incorrect content confidently, because the training label said it was correct. Hallucination in training data is more dangerous than in inference: at inference, you can catch it; in training, it propagates.
Distribution shift. Your seed questions define the training distribution. If seed questions are hand-crafted by engineers, they reflect what engineers think users ask — not what users actually ask. Collect seed questions from real user logs wherever possible.
Synthetic eval contamination. Teams that evaluate the student model only on synthetically generated eval sets get inflated scores. The student has been trained on synthetic data and the eval is also synthetic: of course it looks good. The only trustworthy evaluation is against real, held-out, human-verified examples.
Teacher degradation over time. Teacher models are updated by providers. An answer generated by a teacher model in January 2026 may differ from an answer generated by the “same” model in July 2026 if the provider deployed updates. Pin the teacher model version and record it in your dataset metadata.
Relationship to model collapse research
Shumailov et al. (2024) studied what happens when models are iteratively trained on each other’s outputs with no fresh real data — a scenario they call “model collapse.” They found that repeated self-distillation causes the model to lose the tails of the data distribution (rare but valid outputs), progressively converging to a narrower and less accurate model. The practical implication: synthetic data should augment real data, not replace it. A pipeline that stops collecting real examples and only generates synthetic ones will degrade over successive training cycles.
Further reading
- Distilling the Knowledge in a Neural Network; Hinton et al., 2015. The foundational distillation paper — introduces soft targets and the temperature-scaled distillation loss.
- Distilling Step-by-Step: Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes; Magister et al., 2022. Demonstrates that chain-of-thought distillation outperforms final-answer distillation at the same model scale.
- AI models collapse when trained on recursively generated data; Shumailov et al., 2024. Empirical evidence for model collapse in iterative self-distillation pipelines — the primary reference for understanding why real data cannot be fully replaced.
- Self-play Fine-tuning Converts Weak Language Models to Strong Language Models; Chen et al., 2024. An alternative approach — SPIN — where the student and teacher are the same model, with real data as the anchor. A useful counterpoint to pure teacher-student distillation.