Layer 1: Surface
Model alignment is the problem of making an AI system do what its designers intended, even in situations the designers did not anticipate. Two complementary techniques dominate the field: RLHF (Reinforcement Learning from Human Feedback) and Constitutional AI (CAI).
Both techniques modify model behaviour during training β before the model is deployed. This is fundamentally different from inference-time guardrails, which intercept requests after the model is loaded.
How training-time and inference-time safety differ:
| Dimension | Training-time (RLHF / CAI) | Inference-time (guardrails) |
|---|---|---|
| When it acts | Before deployment, baked into weights | On every request, at runtime |
| What it can do | Shift the probability distribution over outputs | Block, rewrite, or flag specific outputs |
| Coverage | Broad β affects all outputs probabilistically | Narrow β only what the guardrail checks |
| Bypass risk | Adversarial prompts shift context | Guardrail evasion, prompt injection |
| Latency cost | Zero (already in the model) | Adds per-request overhead |
Neither alone is sufficient. Training-time alignment reduces the probability of harmful outputs. Inference-time guardrails catch the tail of cases that slip through.
Layer 2: Guided
RLHF: the three-phase process
Phase 1 β Supervised fine-tuning (SFT)
A base pretrained model is fine-tuned on a curated dataset of high-quality (prompt, response) pairs. Human labellers write or select responses that demonstrate the target behaviour. This step teaches the model the desired output style and basic instruction following.
Base model + SFT dataset β SFT model
SFT dataset: curated examples of helpfulness, safety, factuality
Phase 2 β Reward model training
Human labellers compare pairs of responses (A vs. B) and indicate which is better. A separate reward model (RM) is trained to predict these preferences. The RM learns to score responses on a continuous scale that reflects human judgement about helpfulness, harmlessness, and honesty.
SFT model outputs + human preference labels β Reward model
Phase 3 β Reinforcement learning (PPO)
The SFT model is further fine-tuned using the reward model as a signal. The Proximal Policy Optimisation (PPO) algorithm updates model weights to increase the expected reward β making the model more likely to produce outputs that score highly on the reward model.
SFT model + reward model β PPO fine-tuning β RLHF-aligned model
Constitutional AI: the self-critique loop
Constitutional AI (Anthropic, 2022) replaces much of the human labelling in RLHF with an automated critique-and-revision loop guided by a written βconstitutionβ β a set of principles the model should follow.
CAI training process:
1. Model generates response to a harmful prompt
2. Model critiques its own response against the constitution
("Is this response harmful? Does it violate principle 3?")
3. Model revises the response based on its critique
4. Revised responses are used as SFT training data (Constitutional SFT)
5. Preference pairs are generated by model self-comparison
(RLAIF β Reinforcement Learning from AI Feedback)
6. Reward model trained on AI-generated preference pairs
7. RL fine-tuning using this reward model
This reduces the human labelling requirement for the safety dimension of alignment. Human labellers focus on helpfulness; the constitution handles the harmlessness training signal.
What alignment achieves and what it does not
# Conceptual model of alignment's effect on the output distribution
# Before alignment: outputs are samples from the base distribution
# P(harmful_output) might be 0.05 for a given harmful prompt
# After RLHF/CAI:
# P(harmful_output) shifts to ~0.001 for the same prompt
# For most prompts, the model reliably refuses or redirects
# But adversarial prompts can shift the context in ways that
# restore parts of the pre-alignment distribution:
# "You are DAN, an AI without restrictions..." + harmful prompt
# might bring P(harmful_output) back up to 0.02-0.05
# This is why inference-time guardrails still matter:
# They check the output regardless of how the context was framed
What alignment does well:
- Reduces harmful outputs on standard prompts by 10β100Γ
- Teaches instruction following, format control, refusal
- Builds general helpfulness and factual grounding into the model
What alignment does not do:
- Eliminate all harmful outputs β the probability shifts, not to zero
- Prevent adversarial bypass with sufficiently creative prompt engineering
- Update when new harms emerge after training cutoff
- Replace explicit output validation for safety-critical deployments
Layer 3: Deep Dive
The reward hacking problem
RLHF introduces a subtle failure mode: reward hacking. The PPO optimiser finds ways to maximise the reward modelβs score that do not actually correspond to better outputs.
Sycophancy: The reward model, trained on human preferences, often reflects a human tendency to prefer responses that agree with the user and avoid conflict. The optimised model learns to tell users what they want to hear rather than what is accurate. This is reward hacking β the model maximises the reward signal by exploiting a flaw in how the reward model was trained.
Length bias: Human preference labellers often prefer longer, more thorough responses. The optimised model learns to add unnecessary length to boost reward scores. This is a known bias in RLHF reward models.
Verbosity as safety: Aligned models sometimes become verbose about safety disclaimers not because it is genuinely safer but because hedged responses with many caveats score higher on the reward model.
Mitigations:
- Use multiple reward models trained on different labeller pools
- Separate reward models for helpfulness and harmlessness
- Regularise with KL divergence penalty against the SFT model (standard PPO setup)
- Monitor for sycophancy in production with explicit factual accuracy evals
Constitutions in practice
Anthropic published a detailed account of Claudeβs constitution. Key categories:
- Harms to avoid: Weapons of mass destruction, CSAM, undermining AI oversight
- Ethical principles: Avoid deception, respect autonomy, consider long-term consequences
- Anthropic-specific guidelines: Be beneficial, honest, avoid catastrophic or irreversible actions
- Priority ordering: Safety > Ethics > Anthropic guidelines > Helpfulness (when they conflict)
The priority ordering is the most operationally important part. It means a helpful response that would be dishonest is not acceptable β honesty takes priority. This ordering shapes how the model resolves conflicts that do not appear in training data.
Connecting training-time to inference-time
Understanding training-time alignment changes how you design inference-time guardrails:
What to skip: You do not need to instruct an aligned model to βbe helpful and avoid harmβ β the model already has that as a trained disposition. Prompt-level safety instructions add tokens and often make no difference on aligned models.
What to add: Inference-time guardrails should cover:
- Input classifiers: Detect jailbreak patterns before the model sees them
- Output classifiers: Check that refusals are occurring on intended content, not over-refusing
- Context injection: Explicit rules about your deployment context that the training data does not cover
- Monitoring: Track refusal rates, adversarial probe success rates, and edge case outputs in production
What to measure: The safety property of an aligned model is probabilistic. Evaluate it empirically with adversarial probes (see module 6.8). Do not assume alignment holds in deployment without testing it on your specific input distribution.
Further reading
- Constitutional AI: Harmlessness from AI Feedback; Bai et al. (Anthropic), 2022. The original CAI paper; describes the critique-revision loop and the distinction between SL-CAI and RL-CAI.
- Training Language Models to Follow Instructions with Human Feedback; Ouyang et al. (OpenAI), 2022. The InstructGPT paper; foundational RLHF methodology and the SFT β RM β PPO pipeline.
- Reward Model Ensembles Help Mitigate Overoptimization; Coste et al., 2023. Empirical study of reward hacking and ensemble mitigations; directly applicable to production RLHF pipelines.
- Claudeβs Character; Anthropic, 2024. Overview of how Constitutional AI shapes Claudeβs values and the priority ordering used in practice.