Prototype to Production Checklist: AI Explained

Layer 1: Surface

The gap between a working prototype and a production-ready system is not a gap in model capability: it is a gap in everything around the model.

A prototype proves the concept: given the right input, the model produces the right output. A production system has to handle the wrong input, the unexpected output, the growing conversation, the model upgrade, the cost spike, and the user who tries to break it: reliably, at scale, over months.

Every section of this checklist maps to a module in the Foundations track. Think of it as the audit you run before you ship, and again after every significant change.

The six areas:

Area	What it covers
Reliability	Evals, error handling, hallucination mitigations
Cost & Latency	Right-sized models, prompt efficiency, caching, async
Safety	Input/output guardrails, scope, red-teaming
Observability	Token logging, latency, errors, guardrail triggers
Context management	Budget, history strategy, retrieval constraints
Versioning	Pinned models, prompt versioning, upgrade process

No production AI system graduates from all six on day one. Use the checklist to triage: what is blocking launch, what can follow in the first sprint, what can wait.

Production Gotcha

Common Gotcha: The most expensive production incidents come not from the model failing, but from the application failing to handle what the model does correctly: valid JSON the app doesn’t parse, a refusal the UI doesn’t handle gracefully, a context limit never monitored. Build for the model behaving correctly AND for it doing something unexpected.

Layer 2: Guided

1. Reliability

Eval set in CI (module 1.7)

Eval dataset exists with ≥ 50 examples covering happy path, edge cases, and adversarial inputs
Eval runs automatically on every prompt or model change
Eval score threshold set; CI blocks merges that regress below it
At least one production traffic sample in the eval set (even 10 real inputs)

Hallucination mitigations (module 1.4)

Factual questions answered from retrieved context (RAG), not model recall
System prompt includes an explicit “if you don’t know, say so” instruction
Structured outputs validated beyond schema: factual spot-checks for high-stakes fields
No raw model output written directly to a database or sent in automated emails

Error handling

Application handles model refusals gracefully (stop_reason: "refusal" or empty/canned response text: both patterns occur depending on provider and request type)
JSON parse failures caught and handled: never crash on bad model output
API errors (rate limits, timeouts) handled with retry logic and backoff
Tool loops have a maximum turn limit; infinite loops cannot reach the user

2. Cost & Latency

Right-size models (module 1.3)

Task complexity matched to model tier: classification/routing on fast models; complex reasoning on frontier
Eval run against at least two model tiers; documented quality/cost tradeoff
Floating model aliases replaced with pinned versioned IDs in production

Prompt efficiency (module 1.2)

System prompt reviewed for redundancy: every sentence earns its tokens
Few-shot examples trimmed to the minimum needed (3–5 usually sufficient)
max_tokens set appropriately per task: not a single global default

Caching (module 1.6)

Stable content (system prompt, shared documents) placed first in the request for cache-prefix eligibility
Prompt caching enabled if the provider supports it and your traffic patterns support high hit rates
Cache write overhead accounted for in cost model: not just read savings

Async and batching

Non-real-time workloads (document processing, report generation) run asynchronously: don’t block a user request on a 10-second LLM call
Batch/async API considered for high-volume offline tasks (many providers offer ~50% cost reduction at the expense of latency: check your provider’s docs)
Streaming used for interactive responses >3 seconds: reduces perceived latency significantly

3. Safety

Input guardrails (module 1.8)

User input length capped before reaching the model
Obvious injection patterns blocked at the application layer (cheap, not comprehensive)
Untrusted document content delimited and instructed to be treated as data, not instructions

Output guardrails (module 1.8)

Output checked for out-of-scope content before returning to user
Refusal case handled gracefully: user gets a useful message, not a blank screen
High-risk outputs (those that trigger external actions) require explicit confirmation or human review

Scope and system prompt (modules 1.2, 1.8)

System prompt defines what the model will and won’t do explicitly
No credentials, PII, or secret business logic in the system prompt
Red-team exercise completed: at least one person attempted to elicit harmful or out-of-scope outputs

4. Observability

Token monitoring (module 1.6)

usage.input_tokens and usage.output_tokens logged on every request
Per-request cost calculated and logged (tokens × per-token rate)
Alert configured for p95 token usage approaching context limit

Latency

Time-to-first-token and total response time logged per request
p50/p95 latency baselines established before launch
Alert on latency degradation (e.g. p95 > 2× baseline for 5 minutes)

Errors and guardrails

API error rate logged and alerted
Guardrail trigger rate logged: a spike may indicate an attack or a misconfigured filter
All model inputs and outputs logged (with appropriate data retention and PII handling)

5. Context management

Budget (module 1.6)

Context budget documented: system prompt + tools + max history + max retrieved content + max output ≤ context limit
Token count checked before sending long requests (use count_tokens for pipelines where overflow is critical)

History strategy

Multi-turn applications use summarisation or sliding window: unbounded history is not acceptable
Summarisation triggered at 70–80% context usage, not at 99%
Critical constraints (from early turns) re-stated in the system prompt, not relied on from history

Retrieval (module 1.4)

Retrieved chunk count and size capped: retrieval cannot consume the entire context budget
Retrieval relevance scored; low-relevance chunks excluded rather than included
Most relevant chunk placed first or last in retrieved context block (lost-in-the-middle mitigation)

6. Versioning

Model version (module 1.3)

Pinned versioned model ID used in production (not a floating alias)
Model upgrade process documented: run eval set against new version → canary deploy → full rollout
Model deprecation notices subscribed to

Prompt versioning (module 1.7)

System prompts stored in version control alongside the code that uses them
Prompt changes reviewed like code changes: not edited in a dashboard and forgotten
Eval baseline re-run and documented after any prompt change

Before vs After

Typical prototype:

# --- pseudocode ---
def ask(question):
    response = llm.chat(
        model="frontier",    # frontier for everything
        messages=[{"role": "user", "content": question}],
        max_tokens=4096,     # generous global default
    )
    return response.text     # raw string, no validation

Production-ready:

# --- pseudocode ---
def ask(question: str, history: list[dict]) -> str:
    valid, reason = validate_input(question)
    if not valid:
        return f"I can't help with that: {reason}"

    history = trim_history(history, max_turns=20)

    response = llm.chat(
        model="fast",        # right-sized for this task
        system=SYSTEM_PROMPT,
        messages=history + [{"role": "user", "content": question}],
        max_tokens=512,      # task-appropriate ceiling
    )

    log_usage(response.usage)           # observability

    text = response.text
    if not is_on_topic(text):           # output guardrail
        return "I can only help with..."

    return text

Seven lines became thirty, but those thirty lines handle the cases that cause production incidents.

Layer 3: Deep Dive

Prioritisation: what to do first

Not everything on the checklist is equally urgent. A practical sequence for a first production deployment:

Must-have before launch (P0)

Pinned model version
max_tokens set per task
Token usage logging
API error handling with retry
Input length cap
Graceful handling of refusals and parse failures
At least a minimal eval set in CI

First sprint after launch (P1)

Output guardrail
Context budget documented and monitored
History management strategy implemented
Cost alert configured
System prompt scoped explicitly

Within 30 days (P2)

Red-team exercise
Eval set seeded with real traffic samples
Latency baselines and alerts
Model upgrade process documented
Prompt versioning in git

The failure modes that kill AI features

Silent quality degradation: Model output quality drifts after a provider update or as input distribution shifts. No error, no alert. Detected weeks later from user complaints. Fix: eval in CI + shadow scoring in production.

Context cliff: A long-running session or large document suddenly exceeds the context limit. The API errors, or early context is silently truncated and the model starts contradicting itself. Fix: token monitoring + history management.

Cost spike: A new user pattern (long documents, high-volume usage, tool-heavy workflows) drives token usage far above baseline. P&L impact discovered at end of month. Fix: per-request cost logging + spend alert.

Safety bypass in production: A jailbreak circulates on social media; your bot is used to generate harmful content at scale. Fix: output guardrail + guardrail trigger rate alert + kill switch.

Prompt regression after ‘minor’ change: Someone edits the system prompt to fix one issue and breaks three others. No eval, no one noticed until users reported it. Fix: eval in CI, prompt in version control.

The compounding value of the checklist

Each item on the checklist is cheap in isolation. Together they compound: observability catches the cost spike before it matters; the eval catches the prompt regression before it ships; the history strategy prevents the context cliff; the output guardrail catches the safety bypass the model missed.

A production AI system is not a model: it is a model plus the architecture you wrap it in. The model is the smallest part of what you own.

Prototype to Production Checklist