Layer 1: Surface
Basic RAG, embed the query, find similar chunks, pass them to the model, fails in predictable ways:
| Failure mode | Symptom | Pattern that fixes it |
|---|---|---|
| Query is vague or phrased differently from the documents | Low recall: right document isnât in top-k | Multi-query retrieval |
| Query is abstract; documents are concrete | Low recall: semantic gap between question and answer | HyDE (Hypothetical Document Embeddings) |
| Right chunk is retrieved but lacks surrounding context | Model says âI donât have enough informationâ despite correct retrieval | Small-to-big retrieval |
| Similar chunk embeddings regardless of topic distinction | Low precision: chunks from unrelated topics are retrieved | Contextual retrieval |
| User references earlier conversation turns | Wrong chunks: retrieval ignores conversation context | Conversational RAG |
The principle: diagnose the failure mode first, then apply the pattern. Adding complexity without a measured problem to solve costs latency without a quality payoff.
Production Gotcha
Common Gotcha: Advanced retrieval patterns multiply latency. Multi-query retrieval makes N embedding calls and N database searches per user query. HyDE adds a full generation round-trip before retrieval. Always measure p50/p95 latency impact before enabling a pattern: start with basic RAG, measure where it fails, then reach for the appropriate fix.
Layer 2: Guided
Multi-query retrieval
When a single query doesnât capture all relevant phrasings, generate multiple variations and take the union of results:
# --- pseudocode ---
def multi_query_retrieve(query: str, top_k: int = 5) -> list[dict]:
# Step 1: Generate query variations with a fast, cheap model
variations_response = llm.chat(
model="fast",
system=(
"Generate 3 alternative search queries for the same information need. "
"Use different words and phrasings. Return one query per line, no numbering or bullets."
),
messages=[{"role": "user", "content": query}],
max_tokens=120,
)
queries = [query] + [q.strip() for q in variations_response.text.strip().split("\n") if q.strip()]
# Step 2: Retrieve for each query, deduplicate by chunk ID
seen_ids: set[str] = set()
all_chunks: list[dict] = []
for q in queries:
results = hybrid_search(q, top_k=top_k)
for chunk in results:
if chunk["id"] not in seen_ids:
seen_ids.add(chunk["id"])
all_chunks.append(chunk)
# Step 3: Return a wider pool for re-ranking (module 2.4)
return all_chunks
When to use: queries that are ambiguous, highly specific (users know what they want but not the exact words the document uses), or that could be answered by content with different surface phrasing.
Latency cost: 1 extra LLM call to generate N query variations, then N embedding calls and N vector searches (all of which can run in parallel). Use a fast model for variation generation; cache common query expansions if traffic patterns repeat.
HyDE: Hypothetical Document Embeddings
Standard RAG embeds the question. But questions and their answers often look different in embedding space: âHow do I configure the firewall?â sits far from âSet FIREWALL_POLICY=strict in /etc/config.â
HyDE generates a plausible answer first, then embeds that and uses it to retrieve similar documents:
# --- pseudocode ---
def hyde_retrieve(query: str, top_k: int = 5) -> list[dict]:
# Step 1: Generate a hypothetical answer â doesn't need to be correct
hypothesis = llm.chat(
model="fast",
system=(
"Write a plausible passage that would answer the question, "
"as if it were from a technical document. "
"It doesn't need to be accurate â it will be used only to find similar real documents."
),
messages=[{"role": "user", "content": query}],
max_tokens=200,
).text
# Step 2: Embed the hypothesis, not the query
hypothesis_vector = embedding_model.embed(hypothesis)
# Step 3: Retrieve documents similar to the hypothesis
return vector_db.search(vector=hypothesis_vector, top_k=top_k)
Why it works: the hypothesis is in âanswer spaceâ: the same embedding neighbourhood as real documents that contain the answer. Embedding the question puts you in âquestion space,â which may be distant from âanswer spaceâ for factual lookups.
When to use: abstract or high-level queries against technical or procedural documentation. Less useful when queries and documents share similar vocabulary.
Latency cost: one extra LLM call (hypothesis generation) before any retrieval. Keep max_tokens low: a 100-token hypothesis is usually sufficient.
Small-to-big retrieval (parent-child chunks)
Index at a fine granularity (sentence or short paragraph) for precise retrieval, but expand to a larger unit (paragraph or section) before passing to the model:
# Index structure: two levels
# "child" chunks: small, precise retrieval units (150â200 chars)
# "parent" chunks: larger context units (800â1000 chars), each child points to a parent
def index_with_parents(sections: list[dict]) -> None:
for section in sections:
parent_id = section["id"]
# Store the full parent
parent_store[parent_id] = section["text"]
# Split into child chunks and point each to the parent
children = fixed_chunks(section["text"], chunk_size=200, overlap=20)
for i, child_text in enumerate(children):
vector_db.store(
id=f"{parent_id}-child-{i}",
vector=embedding_model.embed(child_text),
text=child_text,
metadata={"parent_id": parent_id, "source": section["source"]},
)
def small_to_big_retrieve(query: str, top_k: int = 5) -> list[str]:
# Retrieve at child granularity (precise match)
child_results = vector_db.search(vector=embedding_model.embed(query), top_k=top_k * 2)
# Expand to parent context (sufficient context for the model)
seen_parents: set[str] = set()
parent_texts: list[str] = []
for child in child_results:
parent_id = child["metadata"]["parent_id"]
if parent_id not in seen_parents:
seen_parents.add(parent_id)
parent_texts.append(parent_store[parent_id])
return parent_texts[:top_k]
When to use: when retrieval recall is good (right child chunk is found) but answer quality is poor (model lacks surrounding context). The child chunk is precise enough to find; the parent chunk has enough context for a complete answer.
Contextual retrieval
A technique for improving chunk embeddings by prepending a document-aware summary to each chunk before embedding it. Without context, a chunk like âThe maximum value is 100.â is ambiguous: 100 of what, in what system, compared to what baseline?
# --- pseudocode ---
def contextualise_chunk(full_document: str, chunk: str) -> str:
"""
Generate a one-sentence context for the chunk based on the full document.
Prepend it to the chunk text before embedding.
"""
context = llm.chat(
model="fast",
system=(
"Given the full document and a specific excerpt, write one concise sentence "
"that situates the excerpt within the document â what section it's from, "
"what concept it relates to. Be specific. Do not repeat the excerpt."
),
messages=[{"role": "user", "content":
f"Full document:\n{full_document[:3000]}\n\nExcerpt:\n{chunk}"
}],
max_tokens=60,
).text.strip()
return f"{context}\n\n{chunk}"
def index_with_context(documents: list[dict]) -> None:
for doc in documents:
for chunk in chunk_document(doc["text"], doc["id"], doc["source"]):
contextualised = contextualise_chunk(doc["text"], chunk["text"])
chunk["embedding"] = embedding_model.embed(contextualised)
# Store original text (not the contextualised version) for the model to read
vector_db.store(**chunk)
Key insight: embed the contextualised version (better retrieval), but pass the original chunk text to the model (cleaner, no repetition). The context improves the embedding without cluttering the answer.
When to use: large document collections where many chunks share generic language. Particularly effective for technical manuals, legal documents, and long-form reports where section context is critical.
Cost: one LLM call per chunk at index time. For a 10,000-chunk corpus, this is 10,000 calls. Run with a fast model; this is an indexing cost, not a per-query cost.
Conversational RAG
In a multi-turn conversation, a follow-up like âWhat about for enterprise customers?â only makes sense in the context of the previous exchange. Retrieval needs to incorporate the conversation, not just the latest message:
# --- pseudocode ---
def conversational_retrieve(
history: list[dict],
top_k: int = 5,
) -> list[dict]:
# Condense history into a standalone retrieval query
condensed = llm.chat(
model="fast",
system=(
"Given the conversation history, rewrite the user's last question "
"as a complete, standalone search query that captures the full context. "
"Return only the rewritten query."
),
messages=history,
max_tokens=80,
).text.strip()
return hybrid_search(condensed, top_k=top_k)
This avoids the failure mode where a bare âWhat about enterprise?â retrieves nothing useful because it lacks the topic established two turns earlier.
Common mistakes
- Adding patterns without measuring the failure: Multi-query retrieval adds latency even when your recall is already 0.92. Measure recall@k first; only add patterns when itâs below your threshold.
- HyDE hypothesis too long: A 500-token hypothesis drifts from the original intent and retrieves tangentially related documents. Keep hypotheses to 100â200 tokens.
- Small-to-big without deduplication: Multiple children pointing to the same parent will cause the parent to be retrieved multiple times. Deduplicate by parent ID before expanding.
- Contextual retrieval at query time: Contextualising chunks is an indexing cost, not a query cost. Running it at query time for every retrieved chunk multiplies latency by the number of chunks.
- Conversational RAG with full history: Condensing a 20-turn history into a retrieval query produces poor results. Summarise history periodically (module 1.6) or use only the last 3â5 turns for condensation.
Layer 3: Deep Dive
Self-RAG and corrective RAG
Self-RAG (Asai et al., 2023) trains the model to decide when to retrieve (not every query needs retrieval), critique its own outputs for faithfulness, and reflect on whether retrieved passages are relevant. Rather than always retrieving, it retrieves selectively and validates the output.
Corrective RAG (Yan et al., 2024) adds a retrieval evaluator: after initial retrieval, a lightweight model scores each chunk for relevance. Irrelevant chunks are discarded; if no relevant chunks remain, the system falls back to web search or broader retrieval. The generation model only sees validated chunks.
Both require additional components and increase latency significantly. Theyâre most relevant when:
- Retrieval recall is consistently poor and canât be improved by better chunking or hybrid search
- High-stakes outputs where the cost of hallucination outweighs additional latency
For most production systems, strong hybrid search + a re-ranker + a faithfulness guardrail on outputs gets you 90% of the benefit at a fraction of the complexity.
Choosing a pattern based on your eval data
Rather than adding patterns speculatively, use your evaluation data from module 2.6:
| Observation | Diagnosis | Pattern |
|---|---|---|
| Recall@5 is low (< 0.75) | Query isnât matching the right documents | Multi-query + HyDE |
| Recall@10 is high but Recall@5 is low | Right document retrieved but ranked poorly | Re-ranker (module 2.4) |
| Faithfulness is low despite good recall | Model not using retrieved context | Prompting fix (module 2.5) |
| Context precision is low | Too many irrelevant chunks passed | Reduce top-k; add relevance threshold |
| Good eval scores but poor answers on specific queries | Chunking splits the answer across chunks | Small-to-big or larger chunk size |
The eval data from module 2.6 is the diagnostic that tells you which pattern to reach for. Without it, youâre adding complexity to a system you havenât measured.
Latency budgets
Each pattern has a latency cost. Hereâs an approximate ordering for an interactive application (target: under 2 seconds total):
| Pattern | Added latency | Mitigation |
|---|---|---|
| Basic RAG | Baseline: embed + search (~50â200ms) | Cache common embeddings |
| + Hybrid search | +BM25 search time (~10â50ms) | Usually acceptable |
| + Re-ranking | +50â200ms (cross-encoder inference) | Only re-rank top-20, not top-100 |
| + Multi-query (N=3) | +3Ă embedding + search time | Parallelise; cache variations |
| + HyDE | +300â1000ms (generation round-trip) | Use fast model; cache by query |
| + Contextual retrieval | Indexing cost only: no query latency | Run offline |
Contextual retrieval is unique: all the cost is at index time. Itâs the best latency/quality tradeoff among the advanced patterns.
Further reading
- Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE); Gao et al., 2022. The paper introducing Hypothetical Document Embeddings; explains why answer-space embeddings outperform question-space embeddings for dense retrieval.
- Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection; Asai et al., 2023. Selective retrieval with self-critique; foundational for understanding when to retrieve vs when to answer from training.
- Corrective Retrieval Augmented Generation; Yan et al., 2024. Adding a retrieval evaluator to filter irrelevant chunks before generation.
- Contextual Retrieval [Anthropic]; Anthropicâs write-up and evaluation of contextual chunk enrichment; the benchmark data showing recall improvement is useful regardless of which provider you use.