🤖 AI Explained
7 min read

Production RAG Checklist

A RAG prototype that works on your test documents is not a production system. This capstone synthesises the full RAG track into a checklist: the gaps that consistently cause RAG failures after launch, and the order to address them.

Layer 1: Surface

A RAG prototype answers the question: does this approach work? A production RAG system answers a harder set of questions: what happens when the knowledge base gets stale? When a user asks something outside the corpus? When retrieval returns irrelevant chunks? When the index grows to millions of documents? When a document is updated?

The gap between prototype and production is not model quality: it is everything around the retrieval and generation pipeline that makes the system reliable, maintainable, and observable at scale.

The six areas:

AreaWhat it covers
Indexing pipelineIngestion, chunking, embedding, refresh
Query pipelineRetrieval quality, re-ranking, latency
AugmentationPrompt grounding, citation, no-answer handling
EvaluationRetrieval recall, generation faithfulness, CI integration
ObservabilityLatency, retrieval miss rate, cost, quality monitoring
MaintenanceKnowledge base refresh, model upgrades, access control

Production Gotcha

Common Gotcha: The most expensive RAG incidents come from stale data. A knowledge base months out of date produces confident, grounded-sounding wrong answers: the model faithfully cites outdated documents. Build refresh pipelines before you need them. Track the last-updated timestamp of every indexed document.


Layer 2: Guided

1. Indexing pipeline

Document ingestion

  • Text extraction handles all content types in your corpus (PDF, HTML, Markdown, plain text, database records)
  • Preprocessing removes noise: page headers/footers, navigation elements, encoding artefacts
  • Document deduplication in place: indexing the same content twice degrades retrieval precision

Chunking

  • Chunk strategy matched to content type: not a single size for all document types (module 2.3)
  • Overlap set to prevent information loss at split boundaries (10–20% of chunk size)
  • Metadata attached to every chunk: source, document ID, timestamp, chunk index
  • Empty and sub-minimum-length chunks filtered out

Embedding and storage

  • Same embedding model used for indexing and querying: documented and enforced
  • Embedding model version pinned (same production gotcha as LLM model aliases)
  • Batch embedding enabled: not one API call per chunk
  • Content hash stored per chunk to detect changes without full re-index

Refresh

  • Change detection in place: new documents auto-indexed, modified documents re-embedded, deleted documents removed
  • Full re-index process documented and tested: able to run without downtime
  • Last-indexed timestamp per document tracked and alerted on when stale

2. Query pipeline

Retrieval

  • Dense retrieval baseline established and measured (recall@5 on eval set)
  • Hybrid search (dense + sparse) enabled for production: dense-only misses exact matches (module 2.4)
  • Relevance threshold set: chunks below the threshold excluded, not passed to the model
  • Metadata filtering applied before similarity search where access control or date scoping is required

Re-ranking

  • Re-ranking considered if recall@20 is good but recall@5 is poor
  • Re-ranking limited to top-20–50 candidates: not applied to the full corpus
  • Latency impact of re-ranker measured at p95 before enabling in production

Scale

  • ANN index configured for target corpus size and latency requirements
  • Vector database connection pooling and error handling in place
  • Query timeout set: retrieval that takes >5 seconds should fail fast rather than block

3. Augmentation / prompting

Grounding

  • Grounding instruction in system prompt: not the user message (module 2.5)
  • No-answer instruction explicit: exact phrase the model should use when answer is absent
  • Chunks numbered and source-labelled in context block
  • Most relevant chunk placed first in context block

No-answer handling

  • Empty retrieval result handled before reaching the model: returns a useful message directly
  • Low-relevance retrieval result handled (relevance threshold check, not just count check)
  • Model’s “I don’t know” response handled gracefully in the UI: not shown as an error

Safety

  • Untrusted document content delimited with explicit tags (e.g. <document>) and model instructed to treat as data, not instructions (module 1.8)
  • Output guardrail in place for out-of-scope or harmful responses (module 1.8)
  • User input length capped before reaching the retrieval or generation pipeline

4. Evaluation

Retrieval eval

  • Ground-truth eval set exists with ≥ 50 (query, relevant_chunk_ids) pairs
  • Eval set includes ~10–15% unanswerable queries (no relevant chunk exists)
  • Recall@5 measured and above threshold before any indexing pipeline change is promoted
  • Retrieval eval runs in CI without model calls: fast and cheap

Generation eval

  • Faithfulness and answer relevancy measured on a sample of the eval set
  • LLM-as-judge uses a different or stronger model than the one being evaluated
  • RAGAS or equivalent automated scoring in place for regression detection

Eval set maintenance

  • Eval set seeded with real traffic queries within first week of launch
  • New failure modes added to eval set within 24 hours of discovery: don’t let regressions recur

5. Observability

Retrieval metrics

  • Retrieval latency (p50, p95) logged per request
  • Retrieval miss rate logged: fraction of queries where no chunk passes the relevance threshold
  • Chunk count retrieved and chunk count used logged separately (detects over-retrieval)

Generation metrics

  • Generation latency logged per request
  • Token usage (input and output) logged per request for cost tracking
  • Per-request cost calculated: embedding cost + retrieval compute + generation tokens

Alerting

  • Alert on retrieval miss rate spike: indicates knowledge base gap or query distribution shift
  • Alert on generation latency p95 > baseline × 2
  • Alert on cost per request exceeding budget threshold
  • Alert when any document in the corpus hasn’t been refreshed in [staleness threshold]

6. Maintenance

Knowledge base

  • Document ownership defined: who is responsible for each source staying current
  • Refresh cadence set per source (daily for news feeds, weekly for policy docs, on-merge for code)
  • Deletion propagation tested: removing a document from source removes it from the index

Model versions

  • Embedding model version pinned in production: model upgrade requires re-indexing the full corpus
  • Generation model version pinned: upgrade process: run eval set against new version, canary deploy, full rollout
  • Model deprecation notices subscribed to for both embedding and generation providers

Access control

  • Per-user or per-tenant document access enforced at retrieval time (metadata filter by user/tenant)
  • Retrieval cannot surface documents the requesting user is not permitted to read
  • Access control changes propagate to the index: revoking a user’s access removes chunks from their result set

Before vs After

Prototype:

# --- pseudocode ---
def ask(question: str) -> str:
    chunks = vector_db.search(embed(question), top_k=5)
    context = "\n".join(c["text"] for c in chunks)
    return llm.chat(
        model="frontier",   # frontier for everything
        messages=[{"role": "user", "content": f"{context}\n\n{question}"}],
        max_tokens=1024,
    ).text

Production:

# --- pseudocode ---
def ask(question: str, user_id: str) -> str:
    # Input validation
    if len(question) > MAX_INPUT_CHARS:
        return "Question too long."

    # Retrieve: hybrid + access-scoped + relevance-filtered
    candidates = hybrid_search(question, top_k=20, filters={"user_id": user_id})
    chunks = [c for c in candidates if c["score"] > RELEVANCE_THRESHOLD]

    if not chunks:
        return "I don't have information about that in the knowledge base."

    chunks = rerank(question, chunks, top_k=5)

    # Augment with grounding prompt
    system, user_msg = build_rag_prompt(question, chunks)

    # Generate with right-sized model
    response = llm.chat(
        model="balanced",
        system=system,
        messages=[{"role": "user", "content": user_msg}],
        max_tokens=512,
    )

    # Observability
    log_rag_request(question, chunks, response.text, response.usage, user_id)

    return response.text

Layer 3: Deep Dive

Prioritisation: what to do first

Not everything ships on day one. A practical sequence:

P0: must-have before launch

  • Grounding instruction and no-answer path in production
  • Relevance threshold: never send empty or low-quality chunks to the model
  • User input length cap
  • Retrieval latency logging
  • Refresh mechanism for at least the highest-priority documents

P1: first sprint after launch

  • Hybrid search (if dense-only at launch)
  • Ground-truth eval set seeded with first real traffic queries
  • Retrieval miss rate alert
  • Cost per request logging and alert

P2: within 30 days

  • Re-ranker (if eval data shows recall@20 >> recall@5)
  • Faithfulness scoring on a production traffic sample
  • Access control at retrieval time (if multi-tenant)
  • Full document refresh cadence per source documented and automated

The four failure modes that kill RAG features

Retrieval miss: The right chunk isn’t retrieved. Users get confident wrong answers (if the model fills the gap) or “I don’t know” responses (if the grounding instruction holds). Fix: improve hybrid search, tune chunk size, add multi-query.

Context confusion: The model receives the right chunks but produces a wrong answer. Usually a prompting issue (no grounding instruction, instruction placement, conflicting chunks). Fix: module 2.5.

Stale knowledge: Retrieved chunks are correct per the index but the underlying source has been updated. The model faithfully cites outdated information. Fix: document refresh pipeline with staleness monitoring.

Knowledge gap: The user’s question isn’t covered by any document in the corpus. Without an explicit no-answer path, the model hallucinates. Fix: track retrieval miss rate; identify high-miss query categories and fill the corpus gap.

Cost model for a RAG request

Every RAG request has three cost components:

Embedding cost  = (query_tokens / 1_000_000) × embedding_price_per_MTok
                ≈ $0.0000004 per query with text-embedding-3-small ($0.02/MTok) at 20 tokens/query

Retrieval cost  = vector_db_query_cost (varies by provider and index size)
                ≈ $0 (self-hosted Chroma/pgvector) to ~$0.001/query (managed services)

Generation cost = ((system_tokens + context_tokens + query_tokens + output_tokens) / 1M)
                  × model_price_per_MTok
                ≈ with 3 chunks × 300 tokens + 500-token system prompt + 200-token output:
                  ~1,400 input + 200 output tokens per request

At 1M requests/month with a balanced model (~$3/MTok input, $15/MTok output):
  Embedding:  ~$0.40/month
  Generation: ~$4,200/month (input) + $3,000/month (output)
  Total:      ~$7,200/month

The generation cost dominates. Reducing retrieved chunk count from 5 to 3, or right-sizing to a fast model for simple queries, has a much larger impact than optimising embedding costs.

Multitenancy considerations

In a multi-tenant RAG system (different users or organisations with isolated data), access control at retrieval time is non-negotiable:

  • Per-tenant indexes: each tenant has a separate collection: strong isolation, higher operational cost
  • Shared index with metadata filtering: all tenants in one index, filtered by tenant_id at query time: lower cost, requires trust in filter implementation
  • Row-level security in pgvector: leverage Postgres RLS to enforce access control at the database level: strong guarantees without application-layer filtering

Never rely on the model to enforce access control (“tell users only about their own data”). A jailbreak or confused retrieval bypasses this entirely. Enforce isolation at the retrieval layer.

Further reading

✏ Suggest an edit on GitHub

Production RAG Checklist: Check your understanding

Q1

Six months after launch, users report your RAG system gives confident but wrong answers about your product's pricing. The pricing page was updated three months ago. Which production failure mode does this represent?

Q2

You are launching a RAG system for a multi-tenant SaaS product where each company's documents should only be accessible to their own users. How should you enforce this isolation?

Q3

What is the correct approach when you want to upgrade the embedding model in a production RAG system?

Q4

Your RAG system's generation cost dominates the monthly bill. Which optimisation has the largest expected impact?

Q5

Which of the following is a P0: must-have before launching a production RAG system?