πŸ€– AI Explained
Emerging area 5 min read

Tracing & Structured Logging

Learn to instrument LLM systems with structured traces that make debugging and performance analysis practical: what to log, how to structure it, and how to avoid PII liability.

Layer 1: Surface

When an LLM system fails in production, the first question is always: what did it receive, and what did it return? Without structured logging, answering this question means manual reconstruction from fragments. With structured logging, it means querying a trace store.

An LLM trace is a structured record of a single request’s lifecycle: the prompt that went in, the model that handled it, the tokens consumed, the latency at each step, the tool calls made, and the final output. A trace is composed of spans: one for the overall request, one for each LLM call, one for each tool call, one for each retrieval operation.

Good tracing makes the difference between debugging in minutes and debugging in days. It also enables continuous quality measurement: if every response is logged with a correlation ID, you can later route a sample of responses to an LLM judge and get continuous quality metrics without blocking any real requests.

Why it matters

LLM systems have more moving parts than traditional APIs: model calls, retrieval operations, tool executions, prompt assembly, response parsing. Each step can fail in ways that produce a valid-looking output with wrong content. Without traces, these failures are invisible until a user complains.

Production Gotcha

Common Gotcha: Logging full prompts and responses is invaluable for debugging but creates a PII liability if user data appears in prompts. Establish a scrubbing policy before logging goes live: retrofitting PII scrubbing onto existing logs is expensive and error-prone, and a prompt that contains a user's name, email, or health data is a data breach waiting to be discovered.

Define your scrubbing policy in the design phase. Build a scrubbing step into the logging pipeline before the first line of trace data is written.


Layer 2: Guided

The span model

A trace is a tree of spans. The root span represents the entire request; child spans represent sub-operations.

import time
import uuid
import json
from dataclasses import dataclass, field
from typing import Any

@dataclass
class Span:
    span_id: str
    trace_id: str             # Links all spans in one request
    parent_span_id: str | None
    operation: str            # "llm_call", "tool_call", "retrieval", "request"
    start_time: float         # Unix timestamp (seconds since epoch, float precision)
    end_time: float | None = None
    attributes: dict[str, Any] = field(default_factory=dict)
    error: str | None = None

    @property
    def duration_ms(self) -> float | None:
        if self.end_time is None:
            return None
        return (self.end_time - self.start_time) * 1000

    def finish(self, error: str | None = None) -> None:
        self.end_time = time.time()
        if error:
            self.error = error

    def to_dict(self) -> dict:
        return {
            "span_id": self.span_id,
            "trace_id": self.trace_id,
            "parent_span_id": self.parent_span_id,
            "operation": self.operation,
            "start_time": self.start_time,
            "end_time": self.end_time,
            "duration_ms": self.duration_ms,
            "attributes": self.attributes,
            "error": self.error,
        }

class Tracer:
    def __init__(self, trace_id: str | None = None):
        self.trace_id = trace_id or str(uuid.uuid4())
        self.spans: list[Span] = []

    def start_span(
        self,
        operation: str,
        parent: Span | None = None,
        attributes: dict | None = None,
    ) -> Span:
        span = Span(
            span_id=str(uuid.uuid4()),
            trace_id=self.trace_id,
            parent_span_id=parent.span_id if parent else None,
            operation=operation,
            start_time=time.time(),
            attributes=attributes or {},
        )
        self.spans.append(span)
        return span

    def emit(self, sink) -> None:
        """Write all completed spans to the sink (file, OTLP exporter, etc.)."""
        for span in self.spans:
            sink.write(json.dumps(span.to_dict()) + "\n")

What to log at minimum

Every production request must log at a minimum:

@dataclass
class RequestLog:
    # Identity
    request_id: str           # Unique per request
    trace_id: str             # May span multiple services
    session_id: str | None    # User session, if applicable

    # Model
    model_id: str             # Exact model version string
    feature_id: str           # Which product feature made this call

    # Tokens
    input_tokens: int
    output_tokens: int
    total_tokens: int

    # Performance
    latency_ms: float
    time_to_first_token_ms: float | None  # For streaming responses

    # Outcome
    finish_reason: str        # "end_turn", "max_tokens", "error", etc.
    error: str | None

    # Derived cost (compute at log time)
    estimated_cost_usd: float

def build_request_log(
    request_id: str,
    model_id: str,
    feature_id: str,
    response,           # Model response object
    latency_ms: float,
) -> RequestLog:
    usage = response.usage
    cost = estimate_cost(model_id, usage.input_tokens, usage.output_tokens)
    return RequestLog(
        request_id=request_id,
        trace_id=request_id,  # Simple case; use distributed trace ID in multi-service systems
        session_id=None,
        model_id=model_id,
        feature_id=feature_id,
        input_tokens=usage.input_tokens,
        output_tokens=usage.output_tokens,
        total_tokens=usage.input_tokens + usage.output_tokens,
        latency_ms=latency_ms,
        time_to_first_token_ms=None,
        finish_reason=response.stop_reason,
        error=None,
        estimated_cost_usd=cost,
    )

COST_PER_MILLION_TOKENS = {
    # Example pricing β€” always verify with provider documentation
    "fast":     {"input": 0.25,  "output": 1.25},
    "balanced": {"input": 3.00,  "output": 15.00},
    "frontier": {"input": 15.00, "output": 75.00},
}

def estimate_cost(model_id: str, input_tokens: int, output_tokens: int) -> float:
    prices = COST_PER_MILLION_TOKENS.get(model_id, {"input": 1.0, "output": 5.0})
    return (input_tokens * prices["input"] + output_tokens * prices["output"]) / 1_000_000

What to log for debugging

When debugging is needed, richer logs are invaluable, but they carry a PII obligation:

import re
import hashlib

class PIIScrubber:
    PATTERNS = [
        (re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'), '[EMAIL]'),
        (re.compile(r'\b(\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b'), '[PHONE]'),
        (re.compile(r'\b\d{3}-\d{2}-\d{4}\b'), '[SSN]'),
        (re.compile(r'\b(?:\d[ -]?){13,16}\b'), '[CC_NUMBER]'),
    ]

    def scrub(self, text: str) -> str:
        for pattern, replacement in self.PATTERNS:
            text = pattern.sub(replacement, text)
        return text

_scrubber = PIIScrubber()

@dataclass
class DebugLog:
    request_id: str
    prompt_hash: str           # Hash of the full prompt β€” for dedup without storing
    prompt_scrubbed: str       # Prompt after PII scrubbing
    response_scrubbed: str     # Response after PII scrubbing
    tool_calls: list[dict]     # Tool call arguments (scrubbed)
    retrieval_chunks: list[str] # Retrieved context (scrubbed)

def build_debug_log(
    request_id: str,
    full_prompt: str,
    response_text: str,
    tool_calls: list[dict],
    retrieval_chunks: list[str],
) -> DebugLog:
    scrubbed_prompt = _scrubber.scrub(full_prompt)
    scrubbed_response = _scrubber.scrub(response_text)
    prompt_hash = hashlib.sha256(full_prompt.encode()).hexdigest()[:16]

    return DebugLog(
        request_id=request_id,
        prompt_hash=prompt_hash,
        prompt_scrubbed=scrubbed_prompt,
        response_scrubbed=scrubbed_response,
        tool_calls=[
            {k: _scrubber.scrub(str(v)) for k, v in call.items()}
            for call in tool_calls
        ],
        retrieval_chunks=[_scrubber.scrub(c) for c in retrieval_chunks],
    )

Structured logging format

All logs should be JSON lines: one JSON object per line, with consistent field names:

import json
import sys
from datetime import datetime, timezone

def emit_log(level: str, event: str, data: dict) -> None:
    """Write a single structured log line to stdout."""
    record = {
        "ts": datetime.now(timezone.utc).isoformat(),
        "level": level,
        "event": event,
        **data,
    }
    # Write to stdout; your log aggregation system (Datadog, CloudWatch, etc.)
    # picks this up and indexes the JSON fields
    sys.stdout.write(json.dumps(record) + "\n")
    sys.stdout.flush()

# Usage
emit_log("info", "llm_request_complete", {
    "request_id": "req-abc123",
    "model_id": "balanced",
    "feature_id": "support_chat",
    "input_tokens": 512,
    "output_tokens": 128,
    "latency_ms": 847.3,
    "finish_reason": "end_turn",
    "estimated_cost_usd": 0.00353,
})

Correlation IDs across services

In a multi-service architecture, a single user request may touch multiple services. Correlation IDs let you reconstruct the full trace:

def propagate_trace_context(
    outgoing_headers: dict,
    trace_id: str,
    parent_span_id: str,
) -> dict:
    """
    Add OpenTelemetry-compatible trace context headers to an outgoing HTTP request.
    The receiving service reads these and creates child spans under the same trace.
    """
    outgoing_headers["traceparent"] = f"00-{trace_id}-{parent_span_id}-01"
    return outgoing_headers

def extract_trace_context(incoming_headers: dict) -> tuple[str, str | None]:
    """
    Extract trace context from an incoming HTTP request.
    Returns (trace_id, parent_span_id).
    """
    traceparent = incoming_headers.get("traceparent", "")
    parts = traceparent.split("-")
    if len(parts) == 4:
        return parts[1], parts[2]
    return str(uuid.uuid4()), None  # New trace if no context

Layer 3: Deep Dive

OpenTelemetry as the vendor-neutral standard

OpenTelemetry (OTel) is the CNCF-standard framework for distributed tracing, metrics, and logging. Its key concepts map directly to LLM tracing:

OTel conceptLLM tracing equivalent
TraceFull request lifecycle (user query β†’ final response)
SpanIndividual operation (one LLM call, one tool call)
AttributeSpan metadata (model_id, token counts, feature_id)
ExporterSink that forwards spans to a backend (Jaeger, Grafana Tempo, Langfuse)
PropagatorMechanism for passing trace context across service boundaries

The OTel semantic conventions for LLM systems (under the gen_ai namespace) define standard attribute names so that any OTel-compatible tool can index and query LLM traces without custom configuration.

The cost of over-logging and under-logging

Both extremes create problems:

Over-loggingUnder-logging
Storage costHigh: full prompts can be kilobytes each; at scale, terabytes per monthLow
PII liabilityHigh: prompts often contain user dataLow
Debugging capabilityHighLow: β€œwhat did it receive?” can’t be answered
Latency impactLow-medium (async writes)None

The practical balance: log request metadata (tokens, cost, latency, model, feature) always; log full prompt/response only for a configurable sample (typically 1–10%) with PII scrubbing applied. Keep sampled debug logs for a shorter retention window (7–30 days) than metric logs (90+ days).

import random

def should_log_debug(request_id: str, sample_rate: float = 0.05) -> bool:
    """
    Deterministic sampling β€” same request always gets the same decision.
    Uses the request_id to avoid needing to store sampling state.
    """
    # Hash the request_id to a value in [0, 1)
    digest = int(hashlib.sha256(request_id.encode()).hexdigest(), 16)
    normalized = (digest % 1_000_000) / 1_000_000.0
    return normalized < sample_rate

Trace storage and query patterns

Traces are typically stored in two places with different access patterns:

Time-series / metrics store (Prometheus, InfluxDB, CloudWatch Metrics): aggregated metrics derived from traces: p50/p95/p99 latency, error rates, cost per feature per hour. Fast to query; no raw data.

Trace store (Jaeger, Grafana Tempo, Langfuse): raw span trees, queryable by trace_id, model_id, feature_id, time range. Slower to query; full context for debugging.

The most useful query patterns for LLM systems:

# Find all traces where a specific model exceeded 5000ms latency
WHERE model_id = "frontier" AND latency_ms > 5000 ORDER BY latency_ms DESC LIMIT 20

# Find all traces where output_tokens exceeded the budget
WHERE output_tokens > 1000 AND feature_id = "report_generator"

# Find error traces for a specific feature in the last hour
WHERE error IS NOT NULL AND feature_id = "support_chat" AND ts > NOW() - 1h

Further reading

✏ Suggest an edit on GitHub

Tracing & Structured Logging: Check your understanding

Q1

A team enables full prompt and response logging for debugging purposes. Three months later, they discover their logs contain users' medical conditions, names, and insurance numbers from prompts that included user profile data. The team now needs to retroactively scrub PII from terabytes of log data. What should they have done differently?

Q2

A support request spans three microservices: an API gateway, an LLM orchestrator, and a retrieval service. A user reports that their query returned a wrong answer. An engineer needs to reconstruct exactly which prompts were sent, which chunks were retrieved, and what the model returned. What is the minimum requirement for this to be possible?

Q3

A team logs all LLM requests but only stores token counts and latency: not the actual prompts or responses. A production incident occurs where the model starts returning responses that include inappropriate content for certain user inputs. What can the team NOT determine from their logs?

Q4

A team uses the span model for LLM tracing: one root span per request, with child spans for each LLM call, tool call, and retrieval operation. A request that involves one LLM call, two tool calls, and one retrieval operation should produce how many spans?

Q5

A team wants to continuously measure the quality of their production LLM responses without adding latency for users. What architecture enables this?