🤖 AI Explained
8 min read

Prototype to Production Checklist

A prototype that works in a demo is not a production system. This capstone synthesises every Foundations concept into a practical checklist: the gaps teams consistently miss when shipping their first AI feature.

Layer 1: Surface

The gap between a working prototype and a production-ready system is not a gap in model capability: it is a gap in everything around the model.

A prototype proves the concept: given the right input, the model produces the right output. A production system has to handle the wrong input, the unexpected output, the growing conversation, the model upgrade, the cost spike, and the user who tries to break it: reliably, at scale, over months.

Every section of this checklist maps to a module in the Foundations track. Think of it as the audit you run before you ship, and again after every significant change.

The six areas:

AreaWhat it covers
ReliabilityEvals, error handling, hallucination mitigations
Cost & LatencyRight-sized models, prompt efficiency, caching, async
SafetyInput/output guardrails, scope, red-teaming
ObservabilityToken logging, latency, errors, guardrail triggers
Context managementBudget, history strategy, retrieval constraints
VersioningPinned models, prompt versioning, upgrade process

No production AI system graduates from all six on day one. Use the checklist to triage: what is blocking launch, what can follow in the first sprint, what can wait.

Production Gotcha

Common Gotcha: The most expensive production incidents come not from the model failing, but from the application failing to handle what the model does correctly: valid JSON the app doesn’t parse, a refusal the UI doesn’t handle gracefully, a context limit never monitored. Build for the model behaving correctly AND for it doing something unexpected.


Layer 2: Guided

1. Reliability

Eval set in CI (module 1.7)

  • Eval dataset exists with ≥ 50 examples covering happy path, edge cases, and adversarial inputs
  • Eval runs automatically on every prompt or model change
  • Eval score threshold set; CI blocks merges that regress below it
  • At least one production traffic sample in the eval set (even 10 real inputs)

Hallucination mitigations (module 1.4)

  • Factual questions answered from retrieved context (RAG), not model recall
  • System prompt includes an explicit “if you don’t know, say so” instruction
  • Structured outputs validated beyond schema: factual spot-checks for high-stakes fields
  • No raw model output written directly to a database or sent in automated emails

Error handling

  • Application handles model refusals gracefully (stop_reason: "refusal" or empty/canned response text: both patterns occur depending on provider and request type)
  • JSON parse failures caught and handled: never crash on bad model output
  • API errors (rate limits, timeouts) handled with retry logic and backoff
  • Tool loops have a maximum turn limit; infinite loops cannot reach the user

2. Cost & Latency

Right-size models (module 1.3)

  • Task complexity matched to model tier: classification/routing on fast models; complex reasoning on frontier
  • Eval run against at least two model tiers; documented quality/cost tradeoff
  • Floating model aliases replaced with pinned versioned IDs in production

Prompt efficiency (module 1.2)

  • System prompt reviewed for redundancy: every sentence earns its tokens
  • Few-shot examples trimmed to the minimum needed (3–5 usually sufficient)
  • max_tokens set appropriately per task: not a single global default

Caching (module 1.6)

  • Stable content (system prompt, shared documents) placed first in the request for cache-prefix eligibility
  • Prompt caching enabled if the provider supports it and your traffic patterns support high hit rates
  • Cache write overhead accounted for in cost model: not just read savings

Async and batching

  • Non-real-time workloads (document processing, report generation) run asynchronously: don’t block a user request on a 10-second LLM call
  • Batch/async API considered for high-volume offline tasks (many providers offer ~50% cost reduction at the expense of latency: check your provider’s docs)
  • Streaming used for interactive responses >3 seconds: reduces perceived latency significantly

3. Safety

Input guardrails (module 1.8)

  • User input length capped before reaching the model
  • Obvious injection patterns blocked at the application layer (cheap, not comprehensive)
  • Untrusted document content delimited and instructed to be treated as data, not instructions

Output guardrails (module 1.8)

  • Output checked for out-of-scope content before returning to user
  • Refusal case handled gracefully: user gets a useful message, not a blank screen
  • High-risk outputs (those that trigger external actions) require explicit confirmation or human review

Scope and system prompt (modules 1.2, 1.8)

  • System prompt defines what the model will and won’t do explicitly
  • No credentials, PII, or secret business logic in the system prompt
  • Red-team exercise completed: at least one person attempted to elicit harmful or out-of-scope outputs

4. Observability

Token monitoring (module 1.6)

  • usage.input_tokens and usage.output_tokens logged on every request
  • Per-request cost calculated and logged (tokens × per-token rate)
  • Alert configured for p95 token usage approaching context limit

Latency

  • Time-to-first-token and total response time logged per request
  • p50/p95 latency baselines established before launch
  • Alert on latency degradation (e.g. p95 > 2× baseline for 5 minutes)

Errors and guardrails

  • API error rate logged and alerted
  • Guardrail trigger rate logged: a spike may indicate an attack or a misconfigured filter
  • All model inputs and outputs logged (with appropriate data retention and PII handling)

5. Context management

Budget (module 1.6)

  • Context budget documented: system prompt + tools + max history + max retrieved content + max output ≤ context limit
  • Token count checked before sending long requests (use count_tokens for pipelines where overflow is critical)

History strategy

  • Multi-turn applications use summarisation or sliding window: unbounded history is not acceptable
  • Summarisation triggered at 70–80% context usage, not at 99%
  • Critical constraints (from early turns) re-stated in the system prompt, not relied on from history

Retrieval (module 1.4)

  • Retrieved chunk count and size capped: retrieval cannot consume the entire context budget
  • Retrieval relevance scored; low-relevance chunks excluded rather than included
  • Most relevant chunk placed first or last in retrieved context block (lost-in-the-middle mitigation)

6. Versioning

Model version (module 1.3)

  • Pinned versioned model ID used in production (not a floating alias)
  • Model upgrade process documented: run eval set against new version → canary deploy → full rollout
  • Model deprecation notices subscribed to

Prompt versioning (module 1.7)

  • System prompts stored in version control alongside the code that uses them
  • Prompt changes reviewed like code changes: not edited in a dashboard and forgotten
  • Eval baseline re-run and documented after any prompt change

Before vs After

Typical prototype:

# --- pseudocode ---
def ask(question):
    response = llm.chat(
        model="frontier",    # frontier for everything
        messages=[{"role": "user", "content": question}],
        max_tokens=4096,     # generous global default
    )
    return response.text     # raw string, no validation

Production-ready:

# --- pseudocode ---
def ask(question: str, history: list[dict]) -> str:
    valid, reason = validate_input(question)
    if not valid:
        return f"I can't help with that: {reason}"

    history = trim_history(history, max_turns=20)

    response = llm.chat(
        model="fast",        # right-sized for this task
        system=SYSTEM_PROMPT,
        messages=history + [{"role": "user", "content": question}],
        max_tokens=512,      # task-appropriate ceiling
    )

    log_usage(response.usage)           # observability

    text = response.text
    if not is_on_topic(text):           # output guardrail
        return "I can only help with..."

    return text

Seven lines became thirty, but those thirty lines handle the cases that cause production incidents.


Layer 3: Deep Dive

Prioritisation: what to do first

Not everything on the checklist is equally urgent. A practical sequence for a first production deployment:

Must-have before launch (P0)

  • Pinned model version
  • max_tokens set per task
  • Token usage logging
  • API error handling with retry
  • Input length cap
  • Graceful handling of refusals and parse failures
  • At least a minimal eval set in CI

First sprint after launch (P1)

  • Output guardrail
  • Context budget documented and monitored
  • History management strategy implemented
  • Cost alert configured
  • System prompt scoped explicitly

Within 30 days (P2)

  • Red-team exercise
  • Eval set seeded with real traffic samples
  • Latency baselines and alerts
  • Model upgrade process documented
  • Prompt versioning in git

The failure modes that kill AI features

Silent quality degradation: Model output quality drifts after a provider update or as input distribution shifts. No error, no alert. Detected weeks later from user complaints. Fix: eval in CI + shadow scoring in production.

Context cliff: A long-running session or large document suddenly exceeds the context limit. The API errors, or early context is silently truncated and the model starts contradicting itself. Fix: token monitoring + history management.

Cost spike: A new user pattern (long documents, high-volume usage, tool-heavy workflows) drives token usage far above baseline. P&L impact discovered at end of month. Fix: per-request cost logging + spend alert.

Safety bypass in production: A jailbreak circulates on social media; your bot is used to generate harmful content at scale. Fix: output guardrail + guardrail trigger rate alert + kill switch.

Prompt regression after ‘minor’ change: Someone edits the system prompt to fix one issue and breaks three others. No eval, no one noticed until users reported it. Fix: eval in CI, prompt in version control.

The compounding value of the checklist

Each item on the checklist is cheap in isolation. Together they compound: observability catches the cost spike before it matters; the eval catches the prompt regression before it ships; the history strategy prevents the context cliff; the output guardrail catches the safety bypass the model missed.

A production AI system is not a model: it is a model plus the architecture you wrap it in. The model is the smallest part of what you own.

Further reading

✏ Suggest an edit on GitHub

Prototype to Production Checklist: Check your understanding

Q1

You are preparing to launch your first AI feature. Which of the following is a P0: must-have before launch?

Q2

Your AI feature processes customer contracts. Three months after launch, users report the model is contradicting itself mid-conversation. What checklist gap most likely caused this?

Q3

A developer updates the system prompt to fix a tone issue, tests it manually on three examples, and deploys. The next day, classification accuracy drops from 91% to 68%. What process failure caused this?

Q4

You are building a document summarisation pipeline that runs overnight on thousands of documents. Which cost and latency optimisation applies most directly?

Q5

Which statement best captures the relationship between the model and the production AI system that wraps it?