πŸ€– AI Explained
Emerging area 5 min read

Latency Optimization

LLM latency has three distinct components, TTFT, TBT, and E2E, and different use cases require optimizing different ones; knowing which techniques reduce which component, and when prompt caching defeats itself, prevents wasted effort and avoids the most common serving regressions.

Layer 1: Surface

LLM latency is not a single number. It has three components that matter in different ways:

  • TTFT (Time to First Token): how long before the user sees any response. Determined by the prefill phase: how long it takes to process the input. This is what makes a chat interface feel snappy or sluggish.
  • TBT (Time Between Tokens): how quickly tokens arrive after the first one. Determined by the decode phase throughput. This is what makes streaming text feel smooth or choppy.
  • E2E (End-to-End time): total time from request to complete response. Relevant for batch processing and non-streaming applications.

For a streaming chat interface, TTFT dominates the user experience: users will wait 2 seconds before the first word but tolerate slower subsequent tokens. For a batch pipeline that processes thousands of documents, E2E matters and TTFT is irrelevant. Knowing which component you are optimizing changes which techniques you reach for.

The main levers:

TechniqueReducesHow
Prompt cachingTTFTSkip reprocessing a shared prompt prefix
Speculative decodingTBT, E2EGenerate multiple tokens per target model pass
KV cache reuseTTFTShare computed KV state across requests with same prefix
Flash AttentionTTFT, TBTReduce attention computation memory overhead
Smaller modelsAllLess compute per token

Why it matters

Latency is user experience. An application that responds in under 500ms with streaming feels qualitatively different from one that takes 3 seconds before anything appears. Beyond UX, latency determines the cost of real-time applications: lower latency means fewer idle connections, fewer timeouts, and smaller infrastructure needed to meet SLAs.

Production Gotcha

Common Gotcha: Prompt caching is cache-invalidated by any change to the cached prefix: a timestamp or session ID at the start of the system prompt defeats the cache entirely. Put all static content first in the prompt (system instructions, tool definitions, documents) and all variable content last (user message, session context). Any dynamic value in the cached prefix means every request is a cache miss.

This mistake is extremely common. Teams add prompt caching to their system prompt for performance, then find a session ID, request timestamp, or personalization token in the cached portion, and every request becomes a cache miss. The rule is structural: sort your prompt by stability. System instructions first. Tool definitions second. Retrieved documents third. User message last.


Layer 2: Guided

Measuring the three latency components

import time
from dataclasses import dataclass

@dataclass
class LatencyMeasurement:
    request_id: str
    request_start: float
    first_token_time: float | None = None
    last_token_time: float | None = None
    token_timestamps: list[float] = None

    def __post_init__(self):
        self.token_timestamps = self.token_timestamps or []

    def record_token(self):
        now = time.monotonic()
        if self.first_token_time is None:
            self.first_token_time = now
        self.last_token_time = now
        self.token_timestamps.append(now)

    def ttft_ms(self) -> float | None:
        if self.first_token_time is None:
            return None
        return (self.first_token_time - self.request_start) * 1000

    def tbt_p50_ms(self) -> float | None:
        """Median time between tokens (decode throughput indicator)."""
        if len(self.token_timestamps) < 2:
            return None
        gaps = [
            (self.token_timestamps[i+1] - self.token_timestamps[i]) * 1000
            for i in range(len(self.token_timestamps) - 1)
        ]
        gaps.sort()
        return gaps[len(gaps) // 2]

    def e2e_ms(self) -> float | None:
        if self.last_token_time is None:
            return None
        return (self.last_token_time - self.request_start) * 1000

    def tokens_per_second(self) -> float | None:
        if len(self.token_timestamps) < 2 or self.last_token_time is None:
            return None
        duration = self.last_token_time - self.first_token_time
        return len(self.token_timestamps) / duration if duration > 0 else None


def instrument_streaming_request(stream_fn, prompt: str) -> LatencyMeasurement:
    """Wrap a streaming inference call to capture all three latency components."""
    measurement = LatencyMeasurement(
        request_id=str(time.monotonic()),
        request_start=time.monotonic(),
    )
    for token in stream_fn(prompt):
        measurement.record_token()
    return measurement

Prompt caching: structure your prompts correctly

Prompt caching (supported by Anthropic, OpenAI, and most production serving frameworks) works by hashing the prefix of a prompt and checking if the KV state for that prefix is already computed. On a cache hit, the server skips the prefill computation for the cached portion.

The requirement: the cached portion must be a prefix: it must start at the beginning of the prompt and extend to some fixed point. Any change to content before that point is a cache miss.

from dataclasses import dataclass

@dataclass
class PromptStructure:
    """
    Correct prompt structure for maximum cache hit rate.
    Static content first. Variable content last.
    Cache boundary can be placed after any static section.
    """
    system_instructions: str        # STATIC: cached β€” same for all requests
    tool_definitions: list[dict]    # STATIC: cached β€” same for all requests
    retrieved_documents: str        # SEMI-STATIC: cached per document set
    conversation_history: list[dict] # VARIABLE: not cached (grows each turn)
    user_message: str               # VARIABLE: never cached (unique per request)

# Anti-patterns that defeat caching:
BAD_SYSTEM_PROMPT = """
Current time: {timestamp}         ← cache miss on every request
Session ID: {session_id}          ← cache miss on every request
User name: {user_name}            ← cache miss on every request
You are a helpful assistant...    ← this useful static content never gets cached
"""

GOOD_SYSTEM_PROMPT = """
You are a helpful assistant.      ← static; cached
You must respond in JSON format.  ← static; cached
Current time: {timestamp}         ← variable; OK here because it comes AFTER static content
                                    But better to move timestamp to user message.
"""

def build_cache_friendly_prompt(
    system: str,
    tools: list[dict],
    documents: list[str],
    history: list[dict],
    user_message: str,
) -> list[dict]:
    """
    Build a message list with cache-friendly ordering.
    The system prompt and tools are static; they form the cacheable prefix.
    """
    # Build the static context block β€” this is what gets cached
    static_context = system
    if documents:
        static_context += "\n\n## Reference Documents\n" + "\n\n".join(documents)

    messages = [
        {"role": "system", "content": static_context},
    ]
    # Conversation history and user message are dynamic β€” they come after the cached prefix
    messages.extend(history)
    messages.append({"role": "user", "content": user_message})
    return messages

Speculative decoding: when to use it

Speculative decoding uses a small, fast draft model to generate candidate tokens, then verifies them in a single forward pass of the large target model. Tokens that match what the target would have generated are accepted; the first mismatch causes a rollback.

@dataclass
class SpeculativeDecodingConfig:
    draft_model: str            # small, fast model
    target_model: str           # large, accurate model
    num_draft_tokens: int = 5   # tokens to speculate per step
    acceptance_rate: float = 0.0  # measured at runtime

    def expected_speedup(self) -> float:
        """
        Theoretical speedup from speculative decoding.
        At acceptance_rate=0.8 and 5 draft tokens, each target pass
        accepts on average ~4 tokens instead of 1.
        Speedup saturates around 3–4x for typical workloads.
        """
        expected_accepted = self.num_draft_tokens * self.acceptance_rate
        # At minimum we always accept 1 token (the final rollback position)
        tokens_per_target_pass = max(1.0, expected_accepted)
        return tokens_per_target_pass

SPECULATIVE_DECODING_WORKS_WELL = [
    "Code generation β€” predictable syntax patterns β†’ high acceptance rate",
    "Structured output (JSON/XML) β€” constrained token space β†’ high acceptance rate",
    "Repetitive or formulaic text",
    "When a good smaller model exists for the same domain",
]

SPECULATIVE_DECODING_WORKS_POORLY = [
    "Creative writing β€” high entropy output β†’ low acceptance rate",
    "Short responses β€” overhead of draft model not worth it",
    "When no smaller model is available for the domain",
    "When VRAM is already tight β€” draft model consumes additional memory",
]

KV cache reuse and prefix sharing

For applications that send many requests with the same prefix (e.g., same system prompt, same document), prefix caching at the serving layer avoids recomputing KV state for that prefix on every request. This is distinct from prompt caching at the API layer: at the serving layer, the KV tensors themselves are stored and reused.

vLLM implements automatic prefix caching: if a new request begins with a prefix that matches a recently processed request, the existing KV blocks are reused rather than recomputed. This can eliminate TTFT almost entirely for requests that share a long common prefix (e.g., a 2K-token system prompt that all requests share).


Layer 3: Deep Dive

TTFT vs TBT SLA table by use case

Use caseTTFT SLATBT SLAE2E SLAPrimary optimization
Interactive chat (streaming)under 500msunder 50msnot applicableTTFT: prompt caching, fast prefill
Copilot / autocompleteunder 200msunder 30msnot applicableTTFT: small model + speculative decoding
Document summarization1–3s acceptablenot criticalunder 30sE2E: batch + high throughput
Background batch pipelinenot applicablenot applicablehours acceptableE2E: maximum throughput, spot instances
Real-time classificationunder 100msnot applicableunder 200msShort output: minimize decode phase

Flash Attention latency impact

Flash Attention’s primary benefit is memory bandwidth reduction, which translates to latency in two ways:

  1. TTFT reduction for long contexts: standard attention on a 32K token context materializes a 32K Γ— 32K attention matrix in HBM; Flash Attention tiles this computation to avoid the full matrix read/write, reducing prefill time significantly for long contexts.
  2. Enabling longer contexts: without Flash Attention, very long contexts either OOM or require chunked computation with high overhead. Flash Attention makes 100K+ token contexts practical.

For typical short-to-medium context requests (under 4K tokens), Flash Attention’s latency improvement is modest. For long-context applications, it can reduce TTFT by 50% or more.

Request coalescing and prefix grouping

For batch workloads where you control request submission, coalescing requests with the same prefix allows the serving layer to compute the prefix KV once and share it:

def group_by_common_prefix(requests: list[dict], prefix: str) -> dict:
    """
    Group requests by whether they share a common prefix.
    Serving frameworks with prefix caching will benefit automatically,
    but explicit grouping maximizes cache hit rate by submitting prefix-sharing
    requests close together in time (before cached KV is evicted).
    """
    matching = [r for r in requests if r["prompt"].startswith(prefix)]
    non_matching = [r for r in requests if not r["prompt"].startswith(prefix)]
    return {"cached_prefix_group": matching, "other": non_matching}

Submit the grouped requests together to maximize the window during which the cached KV blocks are still resident. Spreading prefix-sharing requests across a long time window risks KV eviction between requests.

Further reading

✏ Suggest an edit on GitHub

Latency Optimization: Check your understanding

Q1

A team implements prompt caching on their system prompt, which contains: 'Session ID: {session_id}\nCurrent time: {timestamp}\nYou are a helpful assistant. Always respond in JSON format.' After deploying, they observe a 0% cache hit rate. What is causing this?

Q2

A streaming chat application wants to minimize the latency until users see the first word of the response. Which latency component should they optimize, and what technique most directly reduces it?

Q3

A team implements speculative decoding with a 1B draft model and a 70B target model. They set num_draft_tokens=5. After deployment, they observe the speedup is close to 1x (no improvement) for their creative writing use case. Why?

Q4

An application sends thousands of requests per minute, all with the same 3,000-token system prompt followed by different user messages. A new engineer suggests enabling prefix caching in vLLM and also spreading requests across multiple geographically distributed vLLM instances for lower latency. What conflict exists in this proposal?

Q5

A team builds a document Q&A system. Each request attaches a 10,000-token document to the prompt. They enable prompt caching. Most documents are unique per request. After a week, they observe almost no reduction in TTFT. Why, and what should they try instead?