Cost Management: AI Explained

Layer 1: Surface

LLM pricing is per-token, and tokens accumulate faster than you expect.

Every character you send and receive costs money. A system prompt that’s 2,000 tokens long, sent on every single request, costs 2,000 tokens per call before the user’s question even arrives. At scale — hundreds of thousands of calls per day — that system prompt alone can become your largest line item.

How billing actually works:

Component	What you pay for	Where costs hide
Input tokens	Everything in the prompt: system prompt, conversation history, retrieved context	Growing context windows, verbose system prompts
Output tokens	Every token the model generates	Uncapped generation, verbose output formats
Cached tokens	Re-used prompt prefixes (where supported)	Usually 50–90% cheaper than input tokens
Tool calls	Each call counts as input + output	Agent loops with many tool invocations

Multi-agent cost amplification. A single user request that triggers an orchestrator, which spawns three sub-agents, each making four tool calls — that’s one user request that generates twelve-plus LLM calls. Budget per user request, not per LLM call.

The two levers you actually control:

Reduce tokens sent — caching, prompt compression, retrieval truncation
Reduce tokens generated — output format constraints, streaming with early stop

Production Gotcha: Teams that implement cost ceilings as hard limits without circuit-breaker logic get silent failures — the agent stops mid-task with no explanation when the budget is hit. Cost ceilings need graceful degradation paths, not just hard stops.

Layer 2: Guided

Token counting before you send

Never estimate token counts. Measure them. Every major provider SDK exposes a tokenizer you can call locally, before the API request.

import anthropic
import tiktoken
import json

client = anthropic.Anthropic()

def count_tokens_anthropic(messages: list[dict], system: str) -> int:
    response = client.messages.count_tokens(
        model="claude-sonnet-4-5",
        system=system,
        messages=messages,
    )
    return response.input_tokens

def count_tokens_openai(messages: list[dict], model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    total = 0
    for msg in messages:
        total += 4  # per-message overhead
        total += len(enc.encode(str(msg.get("content", ""))))
    return total

# Use this before any expensive call
def maybe_truncate_context(messages: list[dict], system: str, ceiling: int = 50_000) -> list[dict]:
    token_count = count_tokens_anthropic(messages, system)
    if token_count > ceiling:
        # Drop oldest messages (not system, not most recent user turn)
        messages = messages[-6:]  # keep last 3 turns
    return messages

Prompt caching

Prompt caching lets you mark static portions of your prompt as cacheable. On cache hit, you pay 10–25% of the normal input token price (varies by provider). The cache lifetime is typically a few minutes to an hour depending on the provider.

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are a senior code reviewer. You enforce the following standards:
[... 3,000 tokens of style guide content ...]
"""

def review_code(code_snippet: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"},  # mark for caching
            }
        ],
        messages=[
            {"role": "user", "content": f"Review this code:\n\n{code_snippet}"}
        ],
    )
    return response.content[0].text

# First call: full input token cost
# Subsequent calls within cache window: ~10% of input token cost for the system prompt

Caching pays off when the same prompt prefix is reused across many requests. The break-even is roughly 4–5 calls. For a 3,000-token system prompt called 1,000 times per day, caching saves around 90% of the input cost for that prefix.

What to cache:

System prompts (always)
Large reference documents injected into context
Conversation history when doing multi-turn sessions with a long preamble

What not to cache:

The user’s current message (it changes every time)
Timestamps or other volatile content in the prompt prefix

Cost ceilings with graceful degradation

A hard limit that kills a request silently is worse than no limit. The user gets a broken experience with no explanation. Build a circuit breaker instead.

from dataclasses import dataclass
from enum import Enum

class DegradationLevel(Enum):
    FULL = "full"
    REDUCED_CONTEXT = "reduced_context"
    FAST_MODEL = "fast_model"
    CACHED_ONLY = "cached_only"
    DECLINED = "declined"

@dataclass
class CostBudget:
    per_request_usd: float
    daily_usd: float
    spent_today_usd: float = 0.0

def get_degradation_level(budget: CostBudget, estimated_cost_usd: float) -> DegradationLevel:
    remaining_daily = budget.daily_usd - budget.spent_today_usd

    if estimated_cost_usd > budget.per_request_usd:
        if estimated_cost_usd * 0.5 <= budget.per_request_usd:
            return DegradationLevel.REDUCED_CONTEXT
        return DegradationLevel.FAST_MODEL

    if remaining_daily < budget.daily_usd * 0.1:  # under 10% daily budget left
        return DegradationLevel.FAST_MODEL

    if remaining_daily <= 0:
        return DegradationLevel.DECLINED

    return DegradationLevel.FULL

def handle_request_with_budget(query: str, budget: CostBudget, context: list[str]) -> str:
    estimated_cost = estimate_cost(query, context)
    level = get_degradation_level(budget, estimated_cost)

    if level == DegradationLevel.FULL:
        return call_full_model(query, context)

    elif level == DegradationLevel.REDUCED_CONTEXT:
        # Trim context to fit budget
        trimmed_context = context[:len(context)//2]
        return call_full_model(query, trimmed_context)

    elif level == DegradationLevel.FAST_MODEL:
        # Downgrade to a cheaper model
        return call_fast_model(query, context)

    elif level == DegradationLevel.DECLINED:
        # Explicit, actionable failure message
        return "Request declined: daily budget exhausted. Resets at midnight UTC."

def estimate_cost(query: str, context: list[str], cost_per_1k_tokens: float = 0.003) -> float:
    total_chars = len(query) + sum(len(c) for c in context)
    estimated_tokens = total_chars / 4  # rough char-to-token ratio
    return (estimated_tokens / 1000) * cost_per_1k_tokens

Per-team cost attribution

In a shared platform, you need to know which team or product is spending what. Implement attribution at the gateway layer, not in individual services.

import time
import sqlite3
from contextlib import contextmanager

class CostAttributionStore:
    def __init__(self, db_path: str = "costs.db"):
        self.conn = sqlite3.connect(db_path, check_same_thread=False)
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS usage (
                ts INTEGER,
                team_id TEXT,
                project_id TEXT,
                model TEXT,
                input_tokens INTEGER,
                output_tokens INTEGER,
                cached_tokens INTEGER,
                cost_usd REAL
            )
        """)

    def record(self, team_id: str, project_id: str, model: str,
               input_tokens: int, output_tokens: int,
               cached_tokens: int = 0) -> float:
        # Example pricing — always fetch actual rates from provider
        rates = {
            "claude-sonnet-4-5": {"input": 0.003, "output": 0.015, "cache": 0.0003},
        }
        rate = rates.get(model, {"input": 0.005, "output": 0.015, "cache": 0.0005})
        cost = (
            (input_tokens / 1000) * rate["input"]
            + (output_tokens / 1000) * rate["output"]
            + (cached_tokens / 1000) * rate["cache"]
        )
        self.conn.execute(
            "INSERT INTO usage VALUES (?, ?, ?, ?, ?, ?, ?, ?)",
            (int(time.time()), team_id, project_id, model,
             input_tokens, output_tokens, cached_tokens, cost)
        )
        self.conn.commit()
        return cost

    def daily_spend(self, team_id: str) -> float:
        day_start = int(time.time()) - 86400
        row = self.conn.execute(
            "SELECT SUM(cost_usd) FROM usage WHERE team_id = ? AND ts > ?",
            (team_id, day_start)
        ).fetchone()
        return row[0] or 0.0

This pattern integrates cleanly into an LLM gateway (see Module 5.8). Every request passes through the gateway, which records attribution and enforces per-team budgets before the upstream API call is made.

Layer 3: Deep Dive

Why multi-agent billing is hard to model

Single-request cost modelling breaks down when agents spawn sub-agents. Consider a document analysis agent:

User request
└── Orchestrator (1 call, ~2,000 input tokens)
    ├── Extraction agent (5 calls × 1,500 tokens = 7,500 tokens)
    ├── Validation agent (3 calls × 800 tokens = 2,400 tokens)
    └── Summary agent (1 call × 4,000 tokens = 4,000 tokens)

Total: ~15,900 input tokens for one user request

The orchestrator alone does not tell you the cost. Cost attribution must be tracked at the sub-agent level and rolled up to the originating user request. Implement a trace ID that propagates through all spawned calls — this doubles as your observability story (Module 6.4).

Failure taxonomy

Type 1: Silent ceiling hit. The agent stops when it hits a budget ceiling, but the user receives no message explaining what happened or what to do. Common in naive rate-limiting implementations.

Type 2: Context growth runaway. Multi-turn conversations where history is never trimmed. Each turn sends the full prior conversation. A 50-turn session with 1,000 tokens per turn accumulates 50,000 tokens of history by the final turn, most of it irrelevant.

Type 3: Tool call explosion. An agent in a retry loop calls the same tool dozens of times when it fails. Without per-tool-call cost tracking, this looks like a single expensive request rather than a bug.

Type 4: Cache invalidation surprise. A minor formatting change to a system prompt (a trailing space, a version number in a comment) breaks cache affinity and doubles input costs overnight without any other change.

Type 5: Model routing misconfiguration. A staging environment routed to a premium model gets left behind when a feature ships. Production calls start hitting the expensive model unintentionally.

Cost forecasting

Model cost growth as a function of usage, not just current spend. The key inputs:

p95 token count per request type — measure this per feature, not globally
Request volume growth rate — weekly average, smoothed
Cache hit rate — a cache hit rate drop from 70% to 40% is a meaningful cost event

def forecast_monthly_cost(
    daily_requests: int,
    avg_input_tokens: int,
    avg_output_tokens: int,
    cache_hit_rate: float,
    input_price_per_1k: float,
    output_price_per_1k: float,
    cache_price_per_1k: float,
    growth_rate_weekly: float = 0.05
) -> dict:
    days = 30
    total_cost = 0.0
    for day in range(days):
        week = day // 7
        volume = daily_requests * ((1 + growth_rate_weekly) ** week)
        cached = volume * cache_hit_rate
        uncached = volume - cached

        day_cost = (
            (uncached * avg_input_tokens / 1000) * input_price_per_1k
            + (cached * avg_input_tokens / 1000) * cache_price_per_1k
            + (volume * avg_output_tokens / 1000) * output_price_per_1k
        )
        total_cost += day_cost

    return {
        "monthly_usd": round(total_cost, 2),
        "daily_avg_usd": round(total_cost / days, 2),
    }

Cost Management