πŸ€– AI Explained
5 min read

Cost Management

LLM costs are non-linear and easy to underestimate β€” especially in multi-agent systems where one orchestration call spawns dozens of sub-calls. This module covers token economics, prompt caching, cost ceilings with graceful degradation, and the attribution infrastructure needed to run LLM workloads sustainably.

Layer 1: Surface

LLM pricing is per-token, and tokens accumulate faster than you expect.

Every character you send and receive costs money. A system prompt that’s 2,000 tokens long, sent on every single request, costs 2,000 tokens per call before the user’s question even arrives. At scale β€” hundreds of thousands of calls per day β€” that system prompt alone can become your largest line item.

How billing actually works:

ComponentWhat you pay forWhere costs hide
Input tokensEverything in the prompt: system prompt, conversation history, retrieved contextGrowing context windows, verbose system prompts
Output tokensEvery token the model generatesUncapped generation, verbose output formats
Cached tokensRe-used prompt prefixes (where supported)Usually 50–90% cheaper than input tokens
Tool callsEach call counts as input + outputAgent loops with many tool invocations

Multi-agent cost amplification. A single user request that triggers an orchestrator, which spawns three sub-agents, each making four tool calls β€” that’s one user request that generates twelve-plus LLM calls. Budget per user request, not per LLM call.

The two levers you actually control:

  1. Reduce tokens sent β€” caching, prompt compression, retrieval truncation
  2. Reduce tokens generated β€” output format constraints, streaming with early stop

Production Gotcha: Teams that implement cost ceilings as hard limits without circuit-breaker logic get silent failures β€” the agent stops mid-task with no explanation when the budget is hit. Cost ceilings need graceful degradation paths, not just hard stops.


Layer 2: Guided

Token counting before you send

Never estimate token counts. Measure them. Every major provider SDK exposes a tokenizer you can call locally, before the API request.

import anthropic
import tiktoken
import json

client = anthropic.Anthropic()

def count_tokens_anthropic(messages: list[dict], system: str) -> int:
    response = client.messages.count_tokens(
        model="claude-sonnet-4-5",
        system=system,
        messages=messages,
    )
    return response.input_tokens

def count_tokens_openai(messages: list[dict], model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    total = 0
    for msg in messages:
        total += 4  # per-message overhead
        total += len(enc.encode(str(msg.get("content", ""))))
    return total

# Use this before any expensive call
def maybe_truncate_context(messages: list[dict], system: str, ceiling: int = 50_000) -> list[dict]:
    token_count = count_tokens_anthropic(messages, system)
    if token_count > ceiling:
        # Drop oldest messages (not system, not most recent user turn)
        messages = messages[-6:]  # keep last 3 turns
    return messages

Prompt caching

Prompt caching lets you mark static portions of your prompt as cacheable. On cache hit, you pay 10–25% of the normal input token price (varies by provider). The cache lifetime is typically a few minutes to an hour depending on the provider.

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are a senior code reviewer. You enforce the following standards:
[... 3,000 tokens of style guide content ...]
"""

def review_code(code_snippet: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"},  # mark for caching
            }
        ],
        messages=[
            {"role": "user", "content": f"Review this code:\n\n{code_snippet}"}
        ],
    )
    return response.content[0].text

# First call: full input token cost
# Subsequent calls within cache window: ~10% of input token cost for the system prompt

Caching pays off when the same prompt prefix is reused across many requests. The break-even is roughly 4–5 calls. For a 3,000-token system prompt called 1,000 times per day, caching saves around 90% of the input cost for that prefix.

What to cache:

  • System prompts (always)
  • Large reference documents injected into context
  • Conversation history when doing multi-turn sessions with a long preamble

What not to cache:

  • The user’s current message (it changes every time)
  • Timestamps or other volatile content in the prompt prefix

Cost ceilings with graceful degradation

A hard limit that kills a request silently is worse than no limit. The user gets a broken experience with no explanation. Build a circuit breaker instead.

from dataclasses import dataclass
from enum import Enum

class DegradationLevel(Enum):
    FULL = "full"
    REDUCED_CONTEXT = "reduced_context"
    FAST_MODEL = "fast_model"
    CACHED_ONLY = "cached_only"
    DECLINED = "declined"

@dataclass
class CostBudget:
    per_request_usd: float
    daily_usd: float
    spent_today_usd: float = 0.0

def get_degradation_level(budget: CostBudget, estimated_cost_usd: float) -> DegradationLevel:
    remaining_daily = budget.daily_usd - budget.spent_today_usd

    if estimated_cost_usd > budget.per_request_usd:
        if estimated_cost_usd * 0.5 <= budget.per_request_usd:
            return DegradationLevel.REDUCED_CONTEXT
        return DegradationLevel.FAST_MODEL

    if remaining_daily < budget.daily_usd * 0.1:  # under 10% daily budget left
        return DegradationLevel.FAST_MODEL

    if remaining_daily <= 0:
        return DegradationLevel.DECLINED

    return DegradationLevel.FULL

def handle_request_with_budget(query: str, budget: CostBudget, context: list[str]) -> str:
    estimated_cost = estimate_cost(query, context)
    level = get_degradation_level(budget, estimated_cost)

    if level == DegradationLevel.FULL:
        return call_full_model(query, context)

    elif level == DegradationLevel.REDUCED_CONTEXT:
        # Trim context to fit budget
        trimmed_context = context[:len(context)//2]
        return call_full_model(query, trimmed_context)

    elif level == DegradationLevel.FAST_MODEL:
        # Downgrade to a cheaper model
        return call_fast_model(query, context)

    elif level == DegradationLevel.DECLINED:
        # Explicit, actionable failure message
        return "Request declined: daily budget exhausted. Resets at midnight UTC."

def estimate_cost(query: str, context: list[str], cost_per_1k_tokens: float = 0.003) -> float:
    total_chars = len(query) + sum(len(c) for c in context)
    estimated_tokens = total_chars / 4  # rough char-to-token ratio
    return (estimated_tokens / 1000) * cost_per_1k_tokens

Per-team cost attribution

In a shared platform, you need to know which team or product is spending what. Implement attribution at the gateway layer, not in individual services.

import time
import sqlite3
from contextlib import contextmanager

class CostAttributionStore:
    def __init__(self, db_path: str = "costs.db"):
        self.conn = sqlite3.connect(db_path, check_same_thread=False)
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS usage (
                ts INTEGER,
                team_id TEXT,
                project_id TEXT,
                model TEXT,
                input_tokens INTEGER,
                output_tokens INTEGER,
                cached_tokens INTEGER,
                cost_usd REAL
            )
        """)

    def record(self, team_id: str, project_id: str, model: str,
               input_tokens: int, output_tokens: int,
               cached_tokens: int = 0) -> float:
        # Example pricing β€” always fetch actual rates from provider
        rates = {
            "claude-sonnet-4-5": {"input": 0.003, "output": 0.015, "cache": 0.0003},
        }
        rate = rates.get(model, {"input": 0.005, "output": 0.015, "cache": 0.0005})
        cost = (
            (input_tokens / 1000) * rate["input"]
            + (output_tokens / 1000) * rate["output"]
            + (cached_tokens / 1000) * rate["cache"]
        )
        self.conn.execute(
            "INSERT INTO usage VALUES (?, ?, ?, ?, ?, ?, ?, ?)",
            (int(time.time()), team_id, project_id, model,
             input_tokens, output_tokens, cached_tokens, cost)
        )
        self.conn.commit()
        return cost

    def daily_spend(self, team_id: str) -> float:
        day_start = int(time.time()) - 86400
        row = self.conn.execute(
            "SELECT SUM(cost_usd) FROM usage WHERE team_id = ? AND ts > ?",
            (team_id, day_start)
        ).fetchone()
        return row[0] or 0.0

This pattern integrates cleanly into an LLM gateway (see Module 5.8). Every request passes through the gateway, which records attribution and enforces per-team budgets before the upstream API call is made.


Layer 3: Deep Dive

Why multi-agent billing is hard to model

Single-request cost modelling breaks down when agents spawn sub-agents. Consider a document analysis agent:

User request
└── Orchestrator (1 call, ~2,000 input tokens)
    β”œβ”€β”€ Extraction agent (5 calls Γ— 1,500 tokens = 7,500 tokens)
    β”œβ”€β”€ Validation agent (3 calls Γ— 800 tokens = 2,400 tokens)
    └── Summary agent (1 call Γ— 4,000 tokens = 4,000 tokens)

Total: ~15,900 input tokens for one user request

The orchestrator alone does not tell you the cost. Cost attribution must be tracked at the sub-agent level and rolled up to the originating user request. Implement a trace ID that propagates through all spawned calls β€” this doubles as your observability story (Module 6.4).

Failure taxonomy

Type 1: Silent ceiling hit. The agent stops when it hits a budget ceiling, but the user receives no message explaining what happened or what to do. Common in naive rate-limiting implementations.

Type 2: Context growth runaway. Multi-turn conversations where history is never trimmed. Each turn sends the full prior conversation. A 50-turn session with 1,000 tokens per turn accumulates 50,000 tokens of history by the final turn, most of it irrelevant.

Type 3: Tool call explosion. An agent in a retry loop calls the same tool dozens of times when it fails. Without per-tool-call cost tracking, this looks like a single expensive request rather than a bug.

Type 4: Cache invalidation surprise. A minor formatting change to a system prompt (a trailing space, a version number in a comment) breaks cache affinity and doubles input costs overnight without any other change.

Type 5: Model routing misconfiguration. A staging environment routed to a premium model gets left behind when a feature ships. Production calls start hitting the expensive model unintentionally.

Cost forecasting

Model cost growth as a function of usage, not just current spend. The key inputs:

  • p95 token count per request type β€” measure this per feature, not globally
  • Request volume growth rate β€” weekly average, smoothed
  • Cache hit rate β€” a cache hit rate drop from 70% to 40% is a meaningful cost event
def forecast_monthly_cost(
    daily_requests: int,
    avg_input_tokens: int,
    avg_output_tokens: int,
    cache_hit_rate: float,
    input_price_per_1k: float,
    output_price_per_1k: float,
    cache_price_per_1k: float,
    growth_rate_weekly: float = 0.05
) -> dict:
    days = 30
    total_cost = 0.0
    for day in range(days):
        week = day // 7
        volume = daily_requests * ((1 + growth_rate_weekly) ** week)
        cached = volume * cache_hit_rate
        uncached = volume - cached

        day_cost = (
            (uncached * avg_input_tokens / 1000) * input_price_per_1k
            + (cached * avg_input_tokens / 1000) * cache_price_per_1k
            + (volume * avg_output_tokens / 1000) * output_price_per_1k
        )
        total_cost += day_cost

    return {
        "monthly_usd": round(total_cost, 2),
        "daily_avg_usd": round(total_cost / days, 2),
    }

Further reading

  • Anthropic. Prompt caching documentation. 2024. Canonical reference for cache_control semantics and cache lifetime behaviour.
  • OpenAI. Prompt caching guide. 2024. OpenAI’s implementation; comparison useful for understanding provider-specific differences.
  • Karpathy, Andrej. Tokenization. 2024. Comprehensive breakdown of how tokenizers work β€” essential for understanding why character-to-token ratios vary and why certain prompt patterns are expensive.
✏ Suggest an edit on GitHub

Cost Management β€” Check your understanding

Q1

Your team deploys a customer support agent. After one week in production, the bill is five times higher than your staging estimate. Staging and production use the same model and system prompt. What is the most likely cause?

Q2

You enable prompt caching on a 2,500-token system prompt. After a week, your costs have not decreased at all. What is the most likely explanation?

Q3

You implement a hard daily spending limit: when the team's budget is exhausted, all API calls return an error. Users start reporting that tasks complete 80% of the way through and then silently stop with no message. What should you change?

Q4

Your platform serves five product teams sharing one LLM API key. At the end of the month, the invoice is $40,000 but no one knows which team spent what. How do you fix this going forward?

Q5

You are forecasting LLM costs for next quarter. Your current cache hit rate is 68%. You plan to add a feature that injects a per-user personalisation block into the system prompt. What cost impact should you anticipate?