Production RAG Checklist: AI Explained

Layer 1: Surface

A RAG prototype answers the question: does this approach work? A production RAG system answers a harder set of questions: what happens when the knowledge base gets stale? When a user asks something outside the corpus? When retrieval returns irrelevant chunks? When the index grows to millions of documents? When a document is updated?

The gap between prototype and production is not model quality: it is everything around the retrieval and generation pipeline that makes the system reliable, maintainable, and observable at scale.

The six areas:

Area	What it covers
Indexing pipeline	Ingestion, chunking, embedding, refresh
Query pipeline	Retrieval quality, re-ranking, latency
Augmentation	Prompt grounding, citation, no-answer handling
Evaluation	Retrieval recall, generation faithfulness, CI integration
Observability	Latency, retrieval miss rate, cost, quality monitoring
Maintenance	Knowledge base refresh, model upgrades, access control

Production Gotcha

Common Gotcha: The most expensive RAG incidents come from stale data. A knowledge base months out of date produces confident, grounded-sounding wrong answers: the model faithfully cites outdated documents. Build refresh pipelines before you need them. Track the last-updated timestamp of every indexed document.

Layer 2: Guided

1. Indexing pipeline

Document ingestion

Text extraction handles all content types in your corpus (PDF, HTML, Markdown, plain text, database records)
Preprocessing removes noise: page headers/footers, navigation elements, encoding artefacts
Document deduplication in place: indexing the same content twice degrades retrieval precision

Chunking

Chunk strategy matched to content type: not a single size for all document types (module 2.3)
Overlap set to prevent information loss at split boundaries (10–20% of chunk size)
Metadata attached to every chunk: source, document ID, timestamp, chunk index
Empty and sub-minimum-length chunks filtered out

Embedding and storage

Same embedding model used for indexing and querying: documented and enforced
Embedding model version pinned (same production gotcha as LLM model aliases)
Batch embedding enabled: not one API call per chunk
Content hash stored per chunk to detect changes without full re-index

Refresh

Change detection in place: new documents auto-indexed, modified documents re-embedded, deleted documents removed
Full re-index process documented and tested: able to run without downtime
Last-indexed timestamp per document tracked and alerted on when stale

2. Query pipeline

Retrieval

Dense retrieval baseline established and measured (recall@5 on eval set)
Hybrid search (dense + sparse) enabled for production: dense-only misses exact matches (module 2.4)
Relevance threshold set: chunks below the threshold excluded, not passed to the model
Metadata filtering applied before similarity search where access control or date scoping is required

Re-ranking

Re-ranking considered if recall@20 is good but recall@5 is poor
Re-ranking limited to top-20–50 candidates: not applied to the full corpus
Latency impact of re-ranker measured at p95 before enabling in production

Scale

ANN index configured for target corpus size and latency requirements
Vector database connection pooling and error handling in place
Query timeout set: retrieval that takes >5 seconds should fail fast rather than block

3. Augmentation / prompting

Grounding

Grounding instruction in system prompt: not the user message (module 2.5)
No-answer instruction explicit: exact phrase the model should use when answer is absent
Chunks numbered and source-labelled in context block
Most relevant chunk placed first in context block

No-answer handling

Empty retrieval result handled before reaching the model: returns a useful message directly
Low-relevance retrieval result handled (relevance threshold check, not just count check)
Model’s “I don’t know” response handled gracefully in the UI: not shown as an error

Safety

Untrusted document content delimited with explicit tags (e.g. <document>) and model instructed to treat as data, not instructions (module 1.8)
Output guardrail in place for out-of-scope or harmful responses (module 1.8)
User input length capped before reaching the retrieval or generation pipeline

4. Evaluation

Retrieval eval

Ground-truth eval set exists with ≥ 50 (query, relevant_chunk_ids) pairs
Eval set includes ~10–15% unanswerable queries (no relevant chunk exists)
Recall@5 measured and above threshold before any indexing pipeline change is promoted
Retrieval eval runs in CI without model calls: fast and cheap

Generation eval

Faithfulness and answer relevancy measured on a sample of the eval set
LLM-as-judge uses a different or stronger model than the one being evaluated
RAGAS or equivalent automated scoring in place for regression detection

Eval set maintenance

Eval set seeded with real traffic queries within first week of launch
New failure modes added to eval set within 24 hours of discovery: don’t let regressions recur

5. Observability

Retrieval metrics

Retrieval latency (p50, p95) logged per request
Retrieval miss rate logged: fraction of queries where no chunk passes the relevance threshold
Chunk count retrieved and chunk count used logged separately (detects over-retrieval)

Generation metrics

Generation latency logged per request
Token usage (input and output) logged per request for cost tracking
Per-request cost calculated: embedding cost + retrieval compute + generation tokens

Alerting

Alert on retrieval miss rate spike: indicates knowledge base gap or query distribution shift
Alert on generation latency p95 > baseline × 2
Alert on cost per request exceeding budget threshold
Alert when any document in the corpus hasn’t been refreshed in [staleness threshold]

6. Maintenance

Knowledge base

Document ownership defined: who is responsible for each source staying current
Refresh cadence set per source (daily for news feeds, weekly for policy docs, on-merge for code)
Deletion propagation tested: removing a document from source removes it from the index

Model versions

Embedding model version pinned in production: model upgrade requires re-indexing the full corpus
Generation model version pinned: upgrade process: run eval set against new version, canary deploy, full rollout
Model deprecation notices subscribed to for both embedding and generation providers

Access control

Per-user or per-tenant document access enforced at retrieval time (metadata filter by user/tenant)
Retrieval cannot surface documents the requesting user is not permitted to read
Access control changes propagate to the index: revoking a user’s access removes chunks from their result set

Before vs After

Prototype:

# --- pseudocode ---
def ask(question: str) -> str:
    chunks = vector_db.search(embed(question), top_k=5)
    context = "\n".join(c["text"] for c in chunks)
    return llm.chat(
        model="frontier",   # frontier for everything
        messages=[{"role": "user", "content": f"{context}\n\n{question}"}],
        max_tokens=1024,
    ).text

Production:

# --- pseudocode ---
def ask(question: str, user_id: str) -> str:
    # Input validation
    if len(question) > MAX_INPUT_CHARS:
        return "Question too long."

    # Retrieve: hybrid + access-scoped + relevance-filtered
    candidates = hybrid_search(question, top_k=20, filters={"user_id": user_id})
    chunks = [c for c in candidates if c["score"] > RELEVANCE_THRESHOLD]

    if not chunks:
        return "I don't have information about that in the knowledge base."

    chunks = rerank(question, chunks, top_k=5)

    # Augment with grounding prompt
    system, user_msg = build_rag_prompt(question, chunks)

    # Generate with right-sized model
    response = llm.chat(
        model="balanced",
        system=system,
        messages=[{"role": "user", "content": user_msg}],
        max_tokens=512,
    )

    # Observability
    log_rag_request(question, chunks, response.text, response.usage, user_id)

    return response.text

Layer 3: Deep Dive

Prioritisation: what to do first

Not everything ships on day one. A practical sequence:

P0: must-have before launch

Grounding instruction and no-answer path in production
Relevance threshold: never send empty or low-quality chunks to the model
User input length cap
Retrieval latency logging
Refresh mechanism for at least the highest-priority documents

P1: first sprint after launch

Hybrid search (if dense-only at launch)
Ground-truth eval set seeded with first real traffic queries
Retrieval miss rate alert
Cost per request logging and alert

P2: within 30 days

Re-ranker (if eval data shows recall@20 >> recall@5)
Faithfulness scoring on a production traffic sample
Access control at retrieval time (if multi-tenant)
Full document refresh cadence per source documented and automated

The four failure modes that kill RAG features

Retrieval miss: The right chunk isn’t retrieved. Users get confident wrong answers (if the model fills the gap) or “I don’t know” responses (if the grounding instruction holds). Fix: improve hybrid search, tune chunk size, add multi-query.

Context confusion: The model receives the right chunks but produces a wrong answer. Usually a prompting issue (no grounding instruction, instruction placement, conflicting chunks). Fix: module 2.5.

Stale knowledge: Retrieved chunks are correct per the index but the underlying source has been updated. The model faithfully cites outdated information. Fix: document refresh pipeline with staleness monitoring.

Knowledge gap: The user’s question isn’t covered by any document in the corpus. Without an explicit no-answer path, the model hallucinates. Fix: track retrieval miss rate; identify high-miss query categories and fill the corpus gap.

Cost model for a RAG request

Every RAG request has three cost components:

Embedding cost  = (query_tokens / 1_000_000) × embedding_price_per_MTok
                ≈ $0.0000004 per query with text-embedding-3-small ($0.02/MTok) at 20 tokens/query

Retrieval cost  = vector_db_query_cost (varies by provider and index size)
                ≈ $0 (self-hosted Chroma/pgvector) to ~$0.001/query (managed services)

Generation cost = ((system_tokens + context_tokens + query_tokens + output_tokens) / 1M)
                  × model_price_per_MTok
                ≈ with 3 chunks × 300 tokens + 500-token system prompt + 200-token output:
                  ~1,400 input + 200 output tokens per request

At 1M requests/month with a balanced model (~$3/MTok input, $15/MTok output):
  Embedding:  ~$0.40/month
  Generation: ~$4,200/month (input) + $3,000/month (output)
  Total:      ~$7,200/month

The generation cost dominates. Reducing retrieved chunk count from 5 to 3, or right-sizing to a fast model for simple queries, has a much larger impact than optimising embedding costs.

Multitenancy considerations

In a multi-tenant RAG system (different users or organisations with isolated data), access control at retrieval time is non-negotiable:

Per-tenant indexes: each tenant has a separate collection: strong isolation, higher operational cost
Shared index with metadata filtering: all tenants in one index, filtered by tenant_id at query time: lower cost, requires trust in filter implementation
Row-level security in pgvector: leverage Postgres RLS to enforce access control at the database level: strong guarantees without application-layer filtering

Never rely on the model to enforce access control (“tell users only about their own data”). A jailbreak or confused retrieval bypasses this entirely. Enforce isolation at the retrieval layer.

Production RAG Checklist