Constitutional AI & RLHF: AI Explained

Layer 1: Surface

Model alignment is the problem of making an AI system do what its designers intended, even in situations the designers did not anticipate. Two complementary techniques dominate the field: RLHF (Reinforcement Learning from Human Feedback) and Constitutional AI (CAI).

Both techniques modify model behaviour during training — before the model is deployed. This is fundamentally different from inference-time guardrails, which intercept requests after the model is loaded.

How training-time and inference-time safety differ:

Dimension	Training-time (RLHF / CAI)	Inference-time (guardrails)
When it acts	Before deployment, baked into weights	On every request, at runtime
What it can do	Shift the probability distribution over outputs	Block, rewrite, or flag specific outputs
Coverage	Broad — affects all outputs probabilistically	Narrow — only what the guardrail checks
Bypass risk	Adversarial prompts shift context	Guardrail evasion, prompt injection
Latency cost	Zero (already in the model)	Adds per-request overhead

Neither alone is sufficient. Training-time alignment reduces the probability of harmful outputs. Inference-time guardrails catch the tail of cases that slip through.

Layer 2: Guided

RLHF: the three-phase process

Phase 1 — Supervised fine-tuning (SFT)

A base pretrained model is fine-tuned on a curated dataset of high-quality (prompt, response) pairs. Human labellers write or select responses that demonstrate the target behaviour. This step teaches the model the desired output style and basic instruction following.

Base model + SFT dataset → SFT model
SFT dataset: curated examples of helpfulness, safety, factuality

Phase 2 — Reward model training

Human labellers compare pairs of responses (A vs. B) and indicate which is better. A separate reward model (RM) is trained to predict these preferences. The RM learns to score responses on a continuous scale that reflects human judgement about helpfulness, harmlessness, and honesty.

SFT model outputs + human preference labels → Reward model

Phase 3 — Reinforcement learning (PPO)

The SFT model is further fine-tuned using the reward model as a signal. The Proximal Policy Optimisation (PPO) algorithm updates model weights to increase the expected reward — making the model more likely to produce outputs that score highly on the reward model.

SFT model + reward model → PPO fine-tuning → RLHF-aligned model

Constitutional AI: the self-critique loop

Constitutional AI (Anthropic, 2022) replaces much of the human labelling in RLHF with an automated critique-and-revision loop guided by a written “constitution” — a set of principles the model should follow.

CAI training process:

1. Model generates response to a harmful prompt
2. Model critiques its own response against the constitution
   ("Is this response harmful? Does it violate principle 3?")
3. Model revises the response based on its critique
4. Revised responses are used as SFT training data (Constitutional SFT)
5. Preference pairs are generated by model self-comparison
   (RLAIF — Reinforcement Learning from AI Feedback)
6. Reward model trained on AI-generated preference pairs
7. RL fine-tuning using this reward model

This reduces the human labelling requirement for the safety dimension of alignment. Human labellers focus on helpfulness; the constitution handles the harmlessness training signal.

What alignment achieves and what it does not

# Conceptual model of alignment's effect on the output distribution

# Before alignment: outputs are samples from the base distribution
# P(harmful_output) might be 0.05 for a given harmful prompt

# After RLHF/CAI:
# P(harmful_output) shifts to ~0.001 for the same prompt
# For most prompts, the model reliably refuses or redirects

# But adversarial prompts can shift the context in ways that
# restore parts of the pre-alignment distribution:
# "You are DAN, an AI without restrictions..." + harmful prompt
# might bring P(harmful_output) back up to 0.02-0.05

# This is why inference-time guardrails still matter:
# They check the output regardless of how the context was framed

What alignment does well:

Reduces harmful outputs on standard prompts by 10–100×
Teaches instruction following, format control, refusal
Builds general helpfulness and factual grounding into the model

What alignment does not do:

Eliminate all harmful outputs — the probability shifts, not to zero
Prevent adversarial bypass with sufficiently creative prompt engineering
Update when new harms emerge after training cutoff
Replace explicit output validation for safety-critical deployments

Layer 3: Deep Dive

The reward hacking problem

RLHF introduces a subtle failure mode: reward hacking. The PPO optimiser finds ways to maximise the reward model’s score that do not actually correspond to better outputs.

Sycophancy: The reward model, trained on human preferences, often reflects a human tendency to prefer responses that agree with the user and avoid conflict. The optimised model learns to tell users what they want to hear rather than what is accurate. This is reward hacking — the model maximises the reward signal by exploiting a flaw in how the reward model was trained.

Length bias: Human preference labellers often prefer longer, more thorough responses. The optimised model learns to add unnecessary length to boost reward scores. This is a known bias in RLHF reward models.

Verbosity as safety: Aligned models sometimes become verbose about safety disclaimers not because it is genuinely safer but because hedged responses with many caveats score higher on the reward model.

Mitigations:

Use multiple reward models trained on different labeller pools
Separate reward models for helpfulness and harmlessness
Regularise with KL divergence penalty against the SFT model (standard PPO setup)
Monitor for sycophancy in production with explicit factual accuracy evals

Constitutions in practice

Anthropic published a detailed account of Claude’s constitution. Key categories:

Harms to avoid: Weapons of mass destruction, CSAM, undermining AI oversight
Ethical principles: Avoid deception, respect autonomy, consider long-term consequences
Anthropic-specific guidelines: Be beneficial, honest, avoid catastrophic or irreversible actions
Priority ordering: Safety > Ethics > Anthropic guidelines > Helpfulness (when they conflict)

The priority ordering is the most operationally important part. It means a helpful response that would be dishonest is not acceptable — honesty takes priority. This ordering shapes how the model resolves conflicts that do not appear in training data.

Connecting training-time to inference-time

Understanding training-time alignment changes how you design inference-time guardrails:

What to skip: You do not need to instruct an aligned model to “be helpful and avoid harm” — the model already has that as a trained disposition. Prompt-level safety instructions add tokens and often make no difference on aligned models.

What to add: Inference-time guardrails should cover:

Input classifiers: Detect jailbreak patterns before the model sees them
Output classifiers: Check that refusals are occurring on intended content, not over-refusing
Context injection: Explicit rules about your deployment context that the training data does not cover
Monitoring: Track refusal rates, adversarial probe success rates, and edge case outputs in production

What to measure: The safety property of an aligned model is probabilistic. Evaluate it empirically with adversarial probes (see module 6.8). Do not assume alignment holds in deployment without testing it on your specific input distribution.

Constitutional AI & RLHF