Infrastructure & Serving

Running AI workloads reliably and cost-efficiently — from local models to production serving.

Senior DevSRE

Hosting Options

Choosing where to run your model determines your cost structure, latency floor, and operational burden: understanding the tradeoffs between API inference, self-hosted, and cloud-managed endpoints lets you pick the right option for each workload rather than defaulting to whatever is easiest to start.

5 min →

5.2

Quantization & Compression

Quantization reduces the memory and compute cost of running a model by storing its weights in lower precision: understanding the tradeoffs between FP16, INT8, and INT4 and the methods used to get there lets you serve larger models on smaller hardware without silently breaking quality.

5 min →

5.3

Inference Serving

Inference servers are not just web servers that happen to call a model: they implement specific memory management and scheduling algorithms that determine whether your GPU serves 5 requests per second or 50; understanding KV cache, PagedAttention, and continuous batching separates the teams who can scale from the teams who can't.

5 min →

5.4

Batching & Throughput

Throughput and latency are in direct tension in LLM serving: understanding how batching works, why continuous batching is the production default, and how to separate throughput benchmarks from latency benchmarks prevents the common mistake of optimizing one while silently destroying the other.

5 min →

5.5

Latency Optimization

LLM latency has three distinct components, TTFT, TBT, and E2E, and different use cases require optimizing different ones; knowing which techniques reduce which component, and when prompt caching defeats itself, prevents wasted effort and avoids the most common serving regressions.

5 min →

5.6

Hardware Selection

Choosing the wrong GPU tier, or sizing VRAM based on model weights alone, is the most common hardware mistake in LLM deployment; knowing the VRAM math, the GPU tiers, and when to use multi-GPU parallelism lets you right-size hardware before you need it rather than after an OOM in production.

5 min →

5.7

Containerization & Deployment

Containerizing an LLM inference server is fundamentally different from containerizing a web service; GPU passthrough, multi-stage weight management, and slow pod startup require different patterns for health checks, rolling deployments, and Kubernetes configuration that most teams learn by breaking production first.

5 min →

5.8

Scaling & Cost Management

LLM serving costs accumulate differently from typical web services; GPU-hours are expensive, autoscaling on CPU metrics is wrong, and scale-to-zero creates cold-start latency that makes it unsuitable for interactive workloads; knowing the right signals to scale on and how to build the cost math keeps infrastructure expenses from becoming a surprise.

5 min →

5.9

Fine-Tuning: When & Why

Fine-tuning is one of several ways to adapt a model to a task — and often the most expensive, slowest, and most fragile. This module is a decision framework: when to fine-tune, when not to, and what you give up either way.

5 min →

5.10

LoRA, QLoRA & PEFT

LoRA lets you adapt a large model by training only a tiny fraction of its parameters — keeping the base weights frozen and adding small trainable matrices on top. This module covers the mechanics, the quantised variant QLoRA, and what production adapter serving actually looks like.

5 min →

5.11

Synthetic Data for Training & Distillation

You can use a large model to generate training data for a smaller one — but the pipeline has failure modes that are hard to detect and expensive to fix once they're baked into weights. This module covers how to build a synthetic data pipeline that doesn't train failure modes into your model.

5 min →

5.12

Sovereign & Air-Gapped AI Architecture

Some data cannot leave your environment. Air-gapped AI deployments run the full stack — embeddings, vector database, and inference — entirely on-premise with no internet access. The architecture is straightforward; the hard parts are model provenance, patch strategy, and keeping the system from going stale.

5 min →

5.13

Caching & Latency Engineering

LLM inference is slow and expensive. Four independent caching layers can cut both — but each operates at a different point in the stack with different invalidation needs. Applying the wrong cache to the wrong layer is worse than no cache at all.

6 min →

Start here →