Inference Serving: AI Explained

Layer 1: Surface

When a language model generates a response, it does not compute each token from scratch. It keeps a running record of all the attention computations it has already done, called the KV cache (key-value cache), and reuses them for each new token. Without this cache, generating a 500-token response would require 500 separate full forward passes through the entire model. With it, each new token only requires one pass.

The KV cache is what makes autoregressive generation practical. It is also what consumes most of your VRAM under production load.

The inference serving stack from bottom to top:

Attention kernels: the math: Flash Attention computes attention in tiled blocks that stay in fast SRAM rather than reading from slow HBM, reducing memory bandwidth pressure.
KV cache management: where to store those intermediate computations: PagedAttention (introduced by vLLM) manages KV cache in pages, like OS virtual memory, so VRAM is used efficiently across requests of different lengths.
Batching: how to schedule multiple requests together: continuous batching processes requests as they arrive rather than waiting for a full batch to form.
Serving layer: the HTTP interface that receives requests, routes them to the right worker, and streams responses back.

Why it matters

The difference between a naive inference server and a production one (vLLM, TGI, llama.cpp) is primarily these algorithms. A naive implementation serving a 13B model might handle 2–5 concurrent requests before OOMing. A well-configured production server serving the same model can handle dozens of concurrent requests with higher throughput and lower latency.

Production Gotcha

Common Gotcha: KV cache memory is the dominant VRAM consumer at inference time: a 70B model may fit in VRAM for single requests but OOM under concurrent load because KV cache for 50 simultaneous requests doesn't fit. VRAM usage scales with batch size times sequence length times model depth. Always load-test your serving configuration with realistic concurrency and context lengths before going to production.

The mistake is testing with single requests and assuming linear scaling. KV cache grows with batch size: serving 50 concurrent requests with 2K token contexts requires roughly 50x the KV cache VRAM of serving one. For a 70B model, this can be hundreds of gigabytes: far more than the model weights themselves. Always benchmark at realistic concurrency.

Layer 2: Guided

KV cache: what it is and why it matters

from dataclasses import dataclass

@dataclass
class KVCacheEstimate:
    """
    Estimate KV cache memory for a given model and workload.
    Formula: 2 * num_layers * num_kv_heads * head_dim * seq_len * batch_size * bytes_per_element
    The factor 2 accounts for both keys and values.
    """
    num_layers: int       # transformer layers (depth)
    num_kv_heads: int     # key/value heads (may differ from query heads with GQA)
    head_dim: int         # dimension per head
    seq_len: int          # max sequence length (input + output)
    batch_size: int       # concurrent requests
    bytes_per_element: float = 2.0  # FP16 = 2 bytes

    def total_gb(self) -> float:
        total_bytes = (
            2 * self.num_layers * self.num_kv_heads * self.head_dim
            * self.seq_len * self.batch_size * self.bytes_per_element
        )
        return total_bytes / (1024 ** 3)

# Llama-3 70B approximate parameters
kv_single = KVCacheEstimate(
    num_layers=80,
    num_kv_heads=8,     # Grouped Query Attention (GQA) reduces kv heads
    head_dim=128,
    seq_len=4096,
    batch_size=1,
)
kv_concurrent = KVCacheEstimate(
    num_layers=80,
    num_kv_heads=8,
    head_dim=128,
    seq_len=4096,
    batch_size=50,
)

print(f"Single request KV cache: ~{kv_single.total_gb():.1f} GB")
print(f"50 concurrent requests:  ~{kv_concurrent.total_gb():.1f} GB")
# Single request KV cache: ~0.7 GB
# 50 concurrent requests:  ~35.0 GB

Grouped Query Attention (GQA), used in Llama-3 and other recent models, reduces the number of KV heads (here 8 heads vs 64 query heads), which directly reduces KV cache size. This is why newer architectures are more efficient to serve at scale.

PagedAttention: paged memory for KV cache

Before PagedAttention (introduced in the vLLM paper, 2023), inference servers pre-allocated a contiguous VRAM block for each request’s maximum possible KV cache. Most requests use far less than the maximum, but the memory was reserved regardless: wasted.

PagedAttention manages KV cache in fixed-size pages, similar to how an OS manages virtual memory. Pages are allocated only as needed and freed when the request completes. Non-contiguous pages are fine because the system tracks the mapping.

The result: VRAM fragmentation is eliminated, and the server can serve significantly more concurrent requests with the same VRAM budget.

from dataclasses import dataclass, field

@dataclass
class PagedKVCache:
    """
    Conceptual model of paged KV cache allocation.
    In vLLM, pages are called 'blocks' and the block size is configurable.
    """
    page_size_tokens: int = 16      # tokens per page
    total_pages: int = 1000         # total pages available
    allocated_pages: dict = field(default_factory=dict)  # request_id -> list[page_id]
    free_pages: set = field(default_factory=set)

    def __post_init__(self):
        self.free_pages = set(range(self.total_pages))

    def allocate(self, request_id: str, num_tokens: int) -> bool:
        """Allocate pages for a new request. Returns True if successful."""
        pages_needed = (num_tokens + self.page_size_tokens - 1) // self.page_size_tokens
        if len(self.free_pages) < pages_needed:
            return False    # OOM — cannot serve this request now

        allocated = set(list(self.free_pages)[:pages_needed])
        self.free_pages -= allocated
        self.allocated_pages[request_id] = list(allocated)
        return True

    def extend(self, request_id: str) -> bool:
        """Extend allocation by one page as generation continues."""
        if not self.free_pages:
            return False
        page = next(iter(self.free_pages))
        self.free_pages.discard(page)
        self.allocated_pages[request_id].append(page)
        return True

    def free(self, request_id: str):
        """Release all pages when a request completes."""
        pages = self.allocated_pages.pop(request_id, [])
        self.free_pages.update(pages)
        return len(pages)   # pages returned to pool

Continuous batching

Static batching: collect N requests, run them all together, return results, collect next N. This means early-finishing requests wait for the slowest request in the batch.

Continuous batching: new requests join the batch as soon as a slot opens. A request that finishes after 50 tokens frees its slot immediately; a new request fills it before the rest of the batch finishes.

from dataclasses import dataclass, field
from collections import deque

@dataclass
class ContinuousBatchScheduler:
    """
    Simplified continuous batching scheduler.
    Real implementations (vLLM, TGI) are significantly more complex.
    """
    max_batch_size: int = 32
    active_requests: dict = field(default_factory=dict)   # id -> state
    waiting_queue: deque = field(default_factory=deque)

    def add_request(self, request_id: str, prompt_tokens: int):
        if len(self.active_requests) < self.max_batch_size:
            self._start_request(request_id, prompt_tokens)
        else:
            self.waiting_queue.append((request_id, prompt_tokens))

    def _start_request(self, request_id: str, prompt_tokens: int):
        self.active_requests[request_id] = {
            "tokens_generated": 0,
            "prompt_tokens": prompt_tokens,
        }

    def step(self) -> dict:
        """
        Run one decode step for all active requests.
        After each step, completed requests are removed and waiting requests are admitted.
        """
        completed = []
        for req_id, state in list(self.active_requests.items()):
            state["tokens_generated"] += 1
            if self._is_done(req_id, state):
                completed.append(req_id)

        for req_id in completed:
            del self.active_requests[req_id]
            # Immediately admit next request from queue
            if self.waiting_queue:
                next_id, next_prompt_tokens = self.waiting_queue.popleft()
                self._start_request(next_id, next_prompt_tokens)

        return {"active": len(self.active_requests), "completed_this_step": len(completed)}

    def _is_done(self, req_id: str, state: dict) -> bool:
        # Placeholder: real logic checks for EOS token or max_tokens
        return state["tokens_generated"] >= 200

Inference server comparison

Server	Primary use case	Strengths	Weaknesses
vLLM	High-throughput GPU serving	PagedAttention, continuous batching, broad model support, active development	Heavyweight; higher memory overhead than llama.cpp
TGI (Text Generation Inference)	Production GPU serving (Hugging Face)	Strong HF model integration, quantization support, good tooling	Slightly less throughput than vLLM in some benchmarks
llama.cpp	CPU inference, edge, development	Runs on CPU and Apple Silicon, GGUF format, low memory footprint	Lower GPU throughput than vLLM/TGI
Ollama	Local development	Easy setup, model management, API compatible with OpenAI client	Not designed for production multi-user serving

For production GPU serving, vLLM and TGI are the primary choices. llama.cpp and Ollama are excellent for development and edge deployments.

Speculative decoding (brief)

Speculative decoding uses a fast small “draft” model to generate candidate tokens, which the large “target” model then verifies in parallel. If the target model accepts the draft tokens (they match what it would have generated), you get multiple tokens per forward pass of the target model.

The speedup depends on the draft model’s acceptance rate for your specific workload: typically 1.5–3x throughput improvement when the acceptance rate is high. It is most effective for tasks with predictable output (code generation, structured responses) and less effective for creative or unpredictable generation.

Layer 3: Deep Dive

Flash Attention and memory bandwidth

Standard attention requires materializing the full attention matrix (sequence length × sequence length) in HBM (high-bandwidth memory). For long contexts, this matrix is large and reading/writing it repeatedly is the bottleneck: not the FLOPs.

Flash Attention (Dao et al., 2022) computes attention in tiles that fit in SRAM (the fast on-chip cache), fusing the operations to avoid repeated HBM reads. The arithmetic intensity increases dramatically, and memory bandwidth pressure drops. Flash Attention 2 and 3 continue this direction with better parallelism.

In practical terms: Flash Attention enables longer context lengths at the same memory budget, and reduces the latency of attention computation: particularly for long sequences.

Serving configuration knobs

Real serving decisions involve tuning several parameters:

Parameter	Effect	Tradeoff
`max_batch_size`	Maximum concurrent requests	Higher throughput vs higher TTFT
`max_model_len` (vLLM)	Maximum context length	Longer context vs more KV cache VRAM
`gpu_memory_utilization` (vLLM)	Fraction of VRAM reserved for KV cache	More concurrent requests vs less buffer for other uses
`tensor_parallel_size`	Number of GPUs to split the model across	Needed for models that don’t fit on one GPU; adds inter-GPU communication overhead

Grouped Query Attention and KV cache size

Modern models increasingly use Grouped Query Attention (GQA) or Multi-Query Attention (MQA) to reduce KV cache size. In standard Multi-Head Attention (MHA), each query head has its own key and value heads: the KV cache scales with num_heads. In GQA, multiple query heads share key/value heads. In MQA, all query heads share a single key/value head.

The impact on serving is significant: a model with 64 query heads but only 8 KV heads (like Llama-3 70B) has 8x smaller KV cache than the equivalent MHA model, enabling proportionally higher batch sizes at the same VRAM budget.

Inference Serving