🤖 AI Explained
Emerging area 5 min read

Synthetic Data for Training & Distillation

You can use a large model to generate training data for a smaller one — but the pipeline has failure modes that are hard to detect and expensive to fix once they're baked into weights. This module covers how to build a synthetic data pipeline that doesn't train failure modes into your model.

Layer 1: Surface

A knowledge distillation pipeline uses a large, capable model (the teacher) to generate examples that a smaller model (the student) then trains on. Done well, the student learns to match or approximate the teacher’s performance on a specific task at a fraction of the inference cost.

This is distinct from using synthetic data for evaluation (covered in track 6). Training data problems are more dangerous: they bake failure modes into the model weights permanently. A bad eval dataset gives you a wrong number. A bad training dataset gives you a broken model.

The core pipeline:

Real-world inputs → Teacher model → Synthetic outputs (labels, reasoning, answers)

                              Quality filtering

                              Student model training

                              Evaluation against held-out real examples

When synthetic data genuinely helps:

ScenarioWhy it works
You need 10,000 labelled examples and have 200 real onesAugments scarce real data
You want a specialised small model for a specific taskTeacher knowledge transferred to smaller student
Labelling real data is expensive (legal, medical review)LLM generation is cheaper than expert annotation
You need diverse edge casesLLMs can generate systematic variations

When it hurts:

ScenarioRisk
Teacher and student are from the same model familyCircular reinforcement — student learns teacher’s biases
No quality filteringHallucinations and errors go into training weights
No evaluation against real held-out dataMode collapse is invisible until production
Teacher has known gaps on your domainStudent inherits and amplifies those gaps

Production Gotcha: Synthetic data generated by the same model family you’re training on creates circular reinforcement — the student learns the teacher’s biases and failure modes. Always validate synthetic training data against a held-out set of real examples and monitor for mode collapse.


Layer 2: Guided

Building a distillation pipeline with quality filtering

The following builds a supervised fine-tuning (SFT) dataset by using a teacher model to generate answers, filtering low-quality outputs, and preparing the result for student training.

import json
import re
from anthropic import Anthropic

client = Anthropic()

# A real-world set of questions from your domain (these must be real, not synthetic)
seed_questions = [
    "What happens when a Kubernetes pod exceeds its memory limit?",
    "Explain the difference between a ClusterIP and a NodePort service.",
    "What does kubectl rollout undo do and when should you use it?",
]

def generate_teacher_answer(question: str) -> dict:
    """Generate a detailed answer from the teacher model."""
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": (
                f"Answer the following technical question clearly and accurately. "
                f"If you are uncertain about any detail, say so explicitly.\n\n"
                f"Question: {question}"
            )
        }]
    )
    return {
        "question": question,
        "answer": response.content[0].text,
        "model": "claude-opus-4-5",
    }

def quality_filter(example: dict) -> bool:
    """
    Filter out examples that signal low quality.
    In production, supplement with an LLM-as-judge scoring step.
    """
    answer = example["answer"]

    # Reject if the teacher expresses uncertainty — these make bad training examples
    uncertainty_markers = [
        "i'm not sure",
        "i don't know",
        "i'm uncertain",
        "i cannot confirm",
        "as of my knowledge cutoff",
    ]
    if any(marker in answer.lower() for marker in uncertainty_markers):
        return False

    # Reject very short answers — likely incomplete
    if len(answer.split()) < 50:
        return False

    # Reject if the answer doesn't address the question (simple heuristic: check for question keywords)
    question_keywords = set(re.findall(r'\b\w{5,}\b', example["question"].lower()))
    answer_keywords = set(re.findall(r'\b\w{5,}\b', answer.lower()))
    overlap = len(question_keywords & answer_keywords) / max(len(question_keywords), 1)
    if overlap < 0.2:
        return False

    return True

def format_for_sft(example: dict) -> dict:
    """Format into the standard chat SFT format expected by most training libraries."""
    return {
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful Kubernetes expert. Answer accurately and clearly."
            },
            {
                "role": "user",
                "content": example["question"]
            },
            {
                "role": "assistant",
                "content": example["answer"]
            }
        ]
    }

# Generate and filter
raw_examples = [generate_teacher_answer(q) for q in seed_questions]
filtered_examples = [e for e in raw_examples if quality_filter(e)]
training_examples = [format_for_sft(e) for e in filtered_examples]

print(f"Generated: {len(raw_examples)}, After filtering: {len(filtered_examples)}")

# Save as JSONL for training
with open("training_data.jsonl", "w") as f:
    for example in training_examples:
        f.write(json.dumps(example) + "\n")

Adding LLM-as-judge quality scoring

Simple heuristic filtering misses subtler failures. Add a scoring step that uses a separate model as judge:

def score_example(example: dict, judge_client: Anthropic) -> float:
    """Score an example 0.0–1.0 using a judge model."""
    judge_prompt = f"""Rate the quality of this answer on a scale of 0.0 to 1.0.

Question: {example['question']}

Answer: {example['answer']}

Score on:
- Accuracy (0.4 weight): Is the answer factually correct?
- Completeness (0.3 weight): Does it fully address the question?
- Clarity (0.3 weight): Is it clearly written without unnecessary caveats?

Return only a JSON object: {{"score": 0.0, "reason": "brief reason"}}"""

    response = judge_client.messages.create(
        model="claude-opus-4-5",
        max_tokens=200,
        messages=[{"role": "user", "content": judge_prompt}]
    )

    result = json.loads(response.content[0].text)
    return result["score"]

# Use a different judge model than the teacher where possible
# At minimum, use a different sampling run or a cross-model judge
QUALITY_THRESHOLD = 0.75
scored_examples = []
for example in filtered_examples:
    score = score_example(example, client)
    if score >= QUALITY_THRESHOLD:
        scored_examples.append(example)

Validating against real held-out data

This is the step that most teams skip. Always hold out a set of real, human-verified examples and evaluate the student model against them after training:

# held_out_real_examples: 50-200 real examples with human-verified answers
# This set is never used for training or synthetic generation

def evaluate_on_held_out(model, held_out_examples: list[dict]) -> dict:
    correct = 0
    for example in held_out_examples:
        student_answer = model.generate(example["question"])
        # Use your task-specific evaluation metric here
        # For exact match: simple comparison
        # For open-ended: LLM judge with consistent rubric
        if matches_reference(student_answer, example["reference_answer"]):
            correct += 1

    return {
        "accuracy": correct / len(held_out_examples),
        "n": len(held_out_examples),
    }

If the student performs well on synthetic eval but poorly on held-out real examples, the synthetic data has diverged from the real distribution — a signal of mode collapse or circular reinforcement.


Layer 3: Deep Dive

The theory of knowledge distillation

Knowledge distillation was formalised by Hinton et al. (2015) in the context of neural network compression, predating LLMs. The core idea: a student trained on a teacher’s soft outputs (probability distributions over the output space) learns more efficiently than a student trained on hard labels alone, because the soft outputs carry information about relationships between classes and the teacher’s uncertainty.

For LLMs, this translates as: training the student on the teacher’s chain-of-thought reasoning, not just final answers, transfers more capability than final-answer distillation alone. The student learns the reasoning path, not just the output.

Why chain-of-thought distillation works:

When a teacher model generates step-by-step reasoning and the student trains on those steps, the student learns to decompose problems similarly. Magister et al. (2022) showed that training a 540M parameter student on GPT-3’s chain-of-thought traces yielded arithmetic reasoning performance matching models 20× larger — a significant jump from final-answer distillation alone.

Mode collapse in synthetic pipelines

Mode collapse refers to the student model converging on a narrow distribution of outputs — answering all variations of a question the same way, or losing the ability to generate diverse valid outputs that the teacher could produce.

The mechanism: synthetic data pipelines generate answers using sampling at some temperature. If the seed questions are narrow, the teacher’s outputs cluster around similar patterns. The student trains on this cluster, and its distribution collapses toward it. Subsequent generations (if the student is used as a new teacher) collapse further.

Detecting mode collapse:

from collections import Counter
import statistics

def measure_output_diversity(model_outputs: list[str]) -> dict:
    """Measure diversity of model outputs as a proxy for mode collapse."""
    # Measure vocabulary diversity (unique tokens per response)
    vocab_sizes = [len(set(output.split())) for output in model_outputs]

    # Measure response length variance (low variance = mode collapse)
    lengths = [len(output.split()) for output in model_outputs]

    # Measure n-gram repetition across outputs
    bigrams = []
    for output in model_outputs:
        words = output.split()
        bigrams.extend(zip(words, words[1:]))

    bigram_counts = Counter(bigrams)
    top_bigram_freq = bigram_counts.most_common(1)[0][1] / len(bigrams) if bigrams else 0

    return {
        "mean_vocab_size": statistics.mean(vocab_sizes),
        "length_std_dev": statistics.stdev(lengths) if len(lengths) > 1 else 0,
        "top_bigram_frequency": top_bigram_freq,
    }

A collapsing model shows: decreasing vocabulary size across responses, shrinking length variance, and increasing repetition of specific bigrams.

Named failure modes

Circular reinforcement. Training on outputs from the same model family as the student amplifies existing biases. The student learns not just the teacher’s competence but also its failure patterns — including systematic biases in how it frames certain topics, its characteristic errors on specific reasoning tasks, and its gaps on underrepresented domains. Use cross-family teachers (e.g., a Gemini teacher for a Llama student) wherever possible.

Hallucination injection. If quality filtering fails to remove examples where the teacher hallucinated, those hallucinations become training signal. The student learns to generate that incorrect content confidently, because the training label said it was correct. Hallucination in training data is more dangerous than in inference: at inference, you can catch it; in training, it propagates.

Distribution shift. Your seed questions define the training distribution. If seed questions are hand-crafted by engineers, they reflect what engineers think users ask — not what users actually ask. Collect seed questions from real user logs wherever possible.

Synthetic eval contamination. Teams that evaluate the student model only on synthetically generated eval sets get inflated scores. The student has been trained on synthetic data and the eval is also synthetic: of course it looks good. The only trustworthy evaluation is against real, held-out, human-verified examples.

Teacher degradation over time. Teacher models are updated by providers. An answer generated by a teacher model in January 2026 may differ from an answer generated by the “same” model in July 2026 if the provider deployed updates. Pin the teacher model version and record it in your dataset metadata.

Relationship to model collapse research

Shumailov et al. (2024) studied what happens when models are iteratively trained on each other’s outputs with no fresh real data — a scenario they call “model collapse.” They found that repeated self-distillation causes the model to lose the tails of the data distribution (rare but valid outputs), progressively converging to a narrower and less accurate model. The practical implication: synthetic data should augment real data, not replace it. A pipeline that stops collecting real examples and only generates synthetic ones will degrade over successive training cycles.

Further reading

✏ Suggest an edit on GitHub

Synthetic Data for Training & Distillation — Check your understanding

Q1

You're building a distillation pipeline: a large frontier model generates answers, and a smaller model trains on them. Both models are from the same provider's model family. What specific risk does this introduce?

Q2

After training on synthetic data, your student model scores 0.91 on the synthetic eval set but only 0.67 on your held-out real examples. What does this indicate?

Q3

Your team is considering using chain-of-thought distillation instead of final-answer distillation for training a student model on a reasoning task. What is the primary benefit?

Q4

Six months after deploying a synthetic-data-trained model, you notice it handles common queries well but completely fails on rare but valid edge cases it used to handle correctly. Diversity metrics on its outputs show decreasing vocabulary and length variance over successive fine-tuning rounds. What is happening?

Q5

You're building a synthetic data pipeline for a medical Q&A model. Your teacher model occasionally expresses uncertainty by saying 'I'm not sure about this specific dosage.' Should these examples be included in training data?