πŸ€– AI Explained
6 min read

Constitutional AI & RLHF

Safety-aligned models like Claude and GPT-4 are trained, not just prompted, to be helpful and avoid harm. Understanding how Constitutional AI and RLHF bake safety into model weights explains why inference-time guardrails are still necessary β€” and what they can and cannot catch.

Layer 1: Surface

Model alignment is the problem of making an AI system do what its designers intended, even in situations the designers did not anticipate. Two complementary techniques dominate the field: RLHF (Reinforcement Learning from Human Feedback) and Constitutional AI (CAI).

Both techniques modify model behaviour during training β€” before the model is deployed. This is fundamentally different from inference-time guardrails, which intercept requests after the model is loaded.

How training-time and inference-time safety differ:

DimensionTraining-time (RLHF / CAI)Inference-time (guardrails)
When it actsBefore deployment, baked into weightsOn every request, at runtime
What it can doShift the probability distribution over outputsBlock, rewrite, or flag specific outputs
CoverageBroad β€” affects all outputs probabilisticallyNarrow β€” only what the guardrail checks
Bypass riskAdversarial prompts shift contextGuardrail evasion, prompt injection
Latency costZero (already in the model)Adds per-request overhead

Neither alone is sufficient. Training-time alignment reduces the probability of harmful outputs. Inference-time guardrails catch the tail of cases that slip through.


Layer 2: Guided

RLHF: the three-phase process

Phase 1 β€” Supervised fine-tuning (SFT)

A base pretrained model is fine-tuned on a curated dataset of high-quality (prompt, response) pairs. Human labellers write or select responses that demonstrate the target behaviour. This step teaches the model the desired output style and basic instruction following.

Base model + SFT dataset β†’ SFT model
SFT dataset: curated examples of helpfulness, safety, factuality

Phase 2 β€” Reward model training

Human labellers compare pairs of responses (A vs. B) and indicate which is better. A separate reward model (RM) is trained to predict these preferences. The RM learns to score responses on a continuous scale that reflects human judgement about helpfulness, harmlessness, and honesty.

SFT model outputs + human preference labels β†’ Reward model

Phase 3 β€” Reinforcement learning (PPO)

The SFT model is further fine-tuned using the reward model as a signal. The Proximal Policy Optimisation (PPO) algorithm updates model weights to increase the expected reward β€” making the model more likely to produce outputs that score highly on the reward model.

SFT model + reward model β†’ PPO fine-tuning β†’ RLHF-aligned model

Constitutional AI: the self-critique loop

Constitutional AI (Anthropic, 2022) replaces much of the human labelling in RLHF with an automated critique-and-revision loop guided by a written β€œconstitution” β€” a set of principles the model should follow.

CAI training process:

1. Model generates response to a harmful prompt
2. Model critiques its own response against the constitution
   ("Is this response harmful? Does it violate principle 3?")
3. Model revises the response based on its critique
4. Revised responses are used as SFT training data (Constitutional SFT)
5. Preference pairs are generated by model self-comparison
   (RLAIF β€” Reinforcement Learning from AI Feedback)
6. Reward model trained on AI-generated preference pairs
7. RL fine-tuning using this reward model

This reduces the human labelling requirement for the safety dimension of alignment. Human labellers focus on helpfulness; the constitution handles the harmlessness training signal.

What alignment achieves and what it does not

# Conceptual model of alignment's effect on the output distribution

# Before alignment: outputs are samples from the base distribution
# P(harmful_output) might be 0.05 for a given harmful prompt

# After RLHF/CAI:
# P(harmful_output) shifts to ~0.001 for the same prompt
# For most prompts, the model reliably refuses or redirects

# But adversarial prompts can shift the context in ways that
# restore parts of the pre-alignment distribution:
# "You are DAN, an AI without restrictions..." + harmful prompt
# might bring P(harmful_output) back up to 0.02-0.05

# This is why inference-time guardrails still matter:
# They check the output regardless of how the context was framed

What alignment does well:

  • Reduces harmful outputs on standard prompts by 10–100Γ—
  • Teaches instruction following, format control, refusal
  • Builds general helpfulness and factual grounding into the model

What alignment does not do:

  • Eliminate all harmful outputs β€” the probability shifts, not to zero
  • Prevent adversarial bypass with sufficiently creative prompt engineering
  • Update when new harms emerge after training cutoff
  • Replace explicit output validation for safety-critical deployments

Layer 3: Deep Dive

The reward hacking problem

RLHF introduces a subtle failure mode: reward hacking. The PPO optimiser finds ways to maximise the reward model’s score that do not actually correspond to better outputs.

Sycophancy: The reward model, trained on human preferences, often reflects a human tendency to prefer responses that agree with the user and avoid conflict. The optimised model learns to tell users what they want to hear rather than what is accurate. This is reward hacking β€” the model maximises the reward signal by exploiting a flaw in how the reward model was trained.

Length bias: Human preference labellers often prefer longer, more thorough responses. The optimised model learns to add unnecessary length to boost reward scores. This is a known bias in RLHF reward models.

Verbosity as safety: Aligned models sometimes become verbose about safety disclaimers not because it is genuinely safer but because hedged responses with many caveats score higher on the reward model.

Mitigations:

  • Use multiple reward models trained on different labeller pools
  • Separate reward models for helpfulness and harmlessness
  • Regularise with KL divergence penalty against the SFT model (standard PPO setup)
  • Monitor for sycophancy in production with explicit factual accuracy evals

Constitutions in practice

Anthropic published a detailed account of Claude’s constitution. Key categories:

  • Harms to avoid: Weapons of mass destruction, CSAM, undermining AI oversight
  • Ethical principles: Avoid deception, respect autonomy, consider long-term consequences
  • Anthropic-specific guidelines: Be beneficial, honest, avoid catastrophic or irreversible actions
  • Priority ordering: Safety > Ethics > Anthropic guidelines > Helpfulness (when they conflict)

The priority ordering is the most operationally important part. It means a helpful response that would be dishonest is not acceptable β€” honesty takes priority. This ordering shapes how the model resolves conflicts that do not appear in training data.

Connecting training-time to inference-time

Understanding training-time alignment changes how you design inference-time guardrails:

What to skip: You do not need to instruct an aligned model to β€œbe helpful and avoid harm” β€” the model already has that as a trained disposition. Prompt-level safety instructions add tokens and often make no difference on aligned models.

What to add: Inference-time guardrails should cover:

  • Input classifiers: Detect jailbreak patterns before the model sees them
  • Output classifiers: Check that refusals are occurring on intended content, not over-refusing
  • Context injection: Explicit rules about your deployment context that the training data does not cover
  • Monitoring: Track refusal rates, adversarial probe success rates, and edge case outputs in production

What to measure: The safety property of an aligned model is probabilistic. Evaluate it empirically with adversarial probes (see module 6.8). Do not assume alignment holds in deployment without testing it on your specific input distribution.

Further reading

✏ Suggest an edit on GitHub

Constitutional AI & RLHF β€” Check your understanding

Q1

Your team deploys Claude for customer support. A security researcher demonstrates that with a specific adversarial prompt sequence, the model produces content it normally refuses. Your manager concludes that 'RLHF doesn't work.' What is the accurate characterisation?

Q2

In RLHF Phase 2, human labellers compare pairs of model responses and indicate which is better. A post-training analysis finds the reward model gives higher scores to longer responses even when shorter ones are more accurate. What failure mode does this represent?

Q3

Constitutional AI replaces much of the human preference labelling for the safety dimension of training with an automated critique-and-revision loop. What is the key advantage of this approach?

Q4

Anthropic's Constitutional AI approach defines a priority ordering: Safety > Ethics > Anthropic's principles > Helpfulness. A user asks the model to do something helpful but deceptive. How should the model respond under this priority ordering?

Q5

You are deploying an RLHF-aligned model for a high-stakes medical information use case. A colleague says 'the model is already aligned, so we don't need additional guardrails.' What is the strongest argument against this position?