Layer 1: Surface
The gap between a working prototype and a production-ready system is not a gap in model capability: it is a gap in everything around the model.
A prototype proves the concept: given the right input, the model produces the right output. A production system has to handle the wrong input, the unexpected output, the growing conversation, the model upgrade, the cost spike, and the user who tries to break it: reliably, at scale, over months.
Every section of this checklist maps to a module in the Foundations track. Think of it as the audit you run before you ship, and again after every significant change.
The six areas:
| Area | What it covers |
|---|---|
| Reliability | Evals, error handling, hallucination mitigations |
| Cost & Latency | Right-sized models, prompt efficiency, caching, async |
| Safety | Input/output guardrails, scope, red-teaming |
| Observability | Token logging, latency, errors, guardrail triggers |
| Context management | Budget, history strategy, retrieval constraints |
| Versioning | Pinned models, prompt versioning, upgrade process |
No production AI system graduates from all six on day one. Use the checklist to triage: what is blocking launch, what can follow in the first sprint, what can wait.
Production Gotcha
Common Gotcha: The most expensive production incidents come not from the model failing, but from the application failing to handle what the model does correctly: valid JSON the app doesn’t parse, a refusal the UI doesn’t handle gracefully, a context limit never monitored. Build for the model behaving correctly AND for it doing something unexpected.
Layer 2: Guided
1. Reliability
Eval set in CI (module 1.7)
- Eval dataset exists with ≥ 50 examples covering happy path, edge cases, and adversarial inputs
- Eval runs automatically on every prompt or model change
- Eval score threshold set; CI blocks merges that regress below it
- At least one production traffic sample in the eval set (even 10 real inputs)
Hallucination mitigations (module 1.4)
- Factual questions answered from retrieved context (RAG), not model recall
- System prompt includes an explicit “if you don’t know, say so” instruction
- Structured outputs validated beyond schema: factual spot-checks for high-stakes fields
- No raw model output written directly to a database or sent in automated emails
Error handling
- Application handles model refusals gracefully (
stop_reason: "refusal"or empty/canned response text: both patterns occur depending on provider and request type) - JSON parse failures caught and handled: never crash on bad model output
- API errors (rate limits, timeouts) handled with retry logic and backoff
- Tool loops have a maximum turn limit; infinite loops cannot reach the user
2. Cost & Latency
Right-size models (module 1.3)
- Task complexity matched to model tier: classification/routing on fast models; complex reasoning on frontier
- Eval run against at least two model tiers; documented quality/cost tradeoff
- Floating model aliases replaced with pinned versioned IDs in production
Prompt efficiency (module 1.2)
- System prompt reviewed for redundancy: every sentence earns its tokens
- Few-shot examples trimmed to the minimum needed (3–5 usually sufficient)
-
max_tokensset appropriately per task: not a single global default
Caching (module 1.6)
- Stable content (system prompt, shared documents) placed first in the request for cache-prefix eligibility
- Prompt caching enabled if the provider supports it and your traffic patterns support high hit rates
- Cache write overhead accounted for in cost model: not just read savings
Async and batching
- Non-real-time workloads (document processing, report generation) run asynchronously: don’t block a user request on a 10-second LLM call
- Batch/async API considered for high-volume offline tasks (many providers offer ~50% cost reduction at the expense of latency: check your provider’s docs)
- Streaming used for interactive responses >3 seconds: reduces perceived latency significantly
3. Safety
Input guardrails (module 1.8)
- User input length capped before reaching the model
- Obvious injection patterns blocked at the application layer (cheap, not comprehensive)
- Untrusted document content delimited and instructed to be treated as data, not instructions
Output guardrails (module 1.8)
- Output checked for out-of-scope content before returning to user
- Refusal case handled gracefully: user gets a useful message, not a blank screen
- High-risk outputs (those that trigger external actions) require explicit confirmation or human review
Scope and system prompt (modules 1.2, 1.8)
- System prompt defines what the model will and won’t do explicitly
- No credentials, PII, or secret business logic in the system prompt
- Red-team exercise completed: at least one person attempted to elicit harmful or out-of-scope outputs
4. Observability
Token monitoring (module 1.6)
-
usage.input_tokensandusage.output_tokenslogged on every request - Per-request cost calculated and logged (tokens × per-token rate)
- Alert configured for p95 token usage approaching context limit
Latency
- Time-to-first-token and total response time logged per request
- p50/p95 latency baselines established before launch
- Alert on latency degradation (e.g. p95 > 2× baseline for 5 minutes)
Errors and guardrails
- API error rate logged and alerted
- Guardrail trigger rate logged: a spike may indicate an attack or a misconfigured filter
- All model inputs and outputs logged (with appropriate data retention and PII handling)
5. Context management
Budget (module 1.6)
- Context budget documented: system prompt + tools + max history + max retrieved content + max output ≤ context limit
- Token count checked before sending long requests (use
count_tokensfor pipelines where overflow is critical)
History strategy
- Multi-turn applications use summarisation or sliding window: unbounded history is not acceptable
- Summarisation triggered at 70–80% context usage, not at 99%
- Critical constraints (from early turns) re-stated in the system prompt, not relied on from history
Retrieval (module 1.4)
- Retrieved chunk count and size capped: retrieval cannot consume the entire context budget
- Retrieval relevance scored; low-relevance chunks excluded rather than included
- Most relevant chunk placed first or last in retrieved context block (lost-in-the-middle mitigation)
6. Versioning
Model version (module 1.3)
- Pinned versioned model ID used in production (not a floating alias)
- Model upgrade process documented: run eval set against new version → canary deploy → full rollout
- Model deprecation notices subscribed to
Prompt versioning (module 1.7)
- System prompts stored in version control alongside the code that uses them
- Prompt changes reviewed like code changes: not edited in a dashboard and forgotten
- Eval baseline re-run and documented after any prompt change
Before vs After
Typical prototype:
# --- pseudocode ---
def ask(question):
response = llm.chat(
model="frontier", # frontier for everything
messages=[{"role": "user", "content": question}],
max_tokens=4096, # generous global default
)
return response.text # raw string, no validation
Production-ready:
# --- pseudocode ---
def ask(question: str, history: list[dict]) -> str:
valid, reason = validate_input(question)
if not valid:
return f"I can't help with that: {reason}"
history = trim_history(history, max_turns=20)
response = llm.chat(
model="fast", # right-sized for this task
system=SYSTEM_PROMPT,
messages=history + [{"role": "user", "content": question}],
max_tokens=512, # task-appropriate ceiling
)
log_usage(response.usage) # observability
text = response.text
if not is_on_topic(text): # output guardrail
return "I can only help with..."
return text
Seven lines became thirty, but those thirty lines handle the cases that cause production incidents.
Layer 3: Deep Dive
Prioritisation: what to do first
Not everything on the checklist is equally urgent. A practical sequence for a first production deployment:
Must-have before launch (P0)
- Pinned model version
max_tokensset per task- Token usage logging
- API error handling with retry
- Input length cap
- Graceful handling of refusals and parse failures
- At least a minimal eval set in CI
First sprint after launch (P1)
- Output guardrail
- Context budget documented and monitored
- History management strategy implemented
- Cost alert configured
- System prompt scoped explicitly
Within 30 days (P2)
- Red-team exercise
- Eval set seeded with real traffic samples
- Latency baselines and alerts
- Model upgrade process documented
- Prompt versioning in git
The failure modes that kill AI features
Silent quality degradation: Model output quality drifts after a provider update or as input distribution shifts. No error, no alert. Detected weeks later from user complaints. Fix: eval in CI + shadow scoring in production.
Context cliff: A long-running session or large document suddenly exceeds the context limit. The API errors, or early context is silently truncated and the model starts contradicting itself. Fix: token monitoring + history management.
Cost spike: A new user pattern (long documents, high-volume usage, tool-heavy workflows) drives token usage far above baseline. P&L impact discovered at end of month. Fix: per-request cost logging + spend alert.
Safety bypass in production: A jailbreak circulates on social media; your bot is used to generate harmful content at scale. Fix: output guardrail + guardrail trigger rate alert + kill switch.
Prompt regression after ‘minor’ change: Someone edits the system prompt to fix one issue and breaks three others. No eval, no one noticed until users reported it. Fix: eval in CI, prompt in version control.
The compounding value of the checklist
Each item on the checklist is cheap in isolation. Together they compound: observability catches the cost spike before it matters; the eval catches the prompt regression before it ships; the history strategy prevents the context cliff; the output guardrail catches the safety bypass the model missed.
A production AI system is not a model: it is a model plus the architecture you wrap it in. The model is the smallest part of what you own.
Further reading
- Building effective agents, Anthropic [Anthropic], Practical guide to production agent architecture; most principles apply to non-agent AI features too.
- ML Engineering for Production (MLOps) Specialization, Coursera, Andrew Ng’s MLOps course; the LLM-specific details are dated but the observability and deployment principles are evergreen.
- LLM Powered Autonomous Agents, Lilian Weng, Comprehensive overview of agent architecture components; useful context for understanding where the Foundations checklist items fit in more complex systems.