Layer 1: Surface
LLM pricing is per-token, and tokens accumulate faster than you expect.
Every character you send and receive costs money. A system prompt thatβs 2,000 tokens long, sent on every single request, costs 2,000 tokens per call before the userβs question even arrives. At scale β hundreds of thousands of calls per day β that system prompt alone can become your largest line item.
How billing actually works:
| Component | What you pay for | Where costs hide |
|---|---|---|
| Input tokens | Everything in the prompt: system prompt, conversation history, retrieved context | Growing context windows, verbose system prompts |
| Output tokens | Every token the model generates | Uncapped generation, verbose output formats |
| Cached tokens | Re-used prompt prefixes (where supported) | Usually 50β90% cheaper than input tokens |
| Tool calls | Each call counts as input + output | Agent loops with many tool invocations |
Multi-agent cost amplification. A single user request that triggers an orchestrator, which spawns three sub-agents, each making four tool calls β thatβs one user request that generates twelve-plus LLM calls. Budget per user request, not per LLM call.
The two levers you actually control:
- Reduce tokens sent β caching, prompt compression, retrieval truncation
- Reduce tokens generated β output format constraints, streaming with early stop
Production Gotcha: Teams that implement cost ceilings as hard limits without circuit-breaker logic get silent failures β the agent stops mid-task with no explanation when the budget is hit. Cost ceilings need graceful degradation paths, not just hard stops.
Layer 2: Guided
Token counting before you send
Never estimate token counts. Measure them. Every major provider SDK exposes a tokenizer you can call locally, before the API request.
import anthropic
import tiktoken
import json
client = anthropic.Anthropic()
def count_tokens_anthropic(messages: list[dict], system: str) -> int:
response = client.messages.count_tokens(
model="claude-sonnet-4-5",
system=system,
messages=messages,
)
return response.input_tokens
def count_tokens_openai(messages: list[dict], model: str = "gpt-4o") -> int:
enc = tiktoken.encoding_for_model(model)
total = 0
for msg in messages:
total += 4 # per-message overhead
total += len(enc.encode(str(msg.get("content", ""))))
return total
# Use this before any expensive call
def maybe_truncate_context(messages: list[dict], system: str, ceiling: int = 50_000) -> list[dict]:
token_count = count_tokens_anthropic(messages, system)
if token_count > ceiling:
# Drop oldest messages (not system, not most recent user turn)
messages = messages[-6:] # keep last 3 turns
return messages
Prompt caching
Prompt caching lets you mark static portions of your prompt as cacheable. On cache hit, you pay 10β25% of the normal input token price (varies by provider). The cache lifetime is typically a few minutes to an hour depending on the provider.
import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = """You are a senior code reviewer. You enforce the following standards:
[... 3,000 tokens of style guide content ...]
"""
def review_code(code_snippet: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}, # mark for caching
}
],
messages=[
{"role": "user", "content": f"Review this code:\n\n{code_snippet}"}
],
)
return response.content[0].text
# First call: full input token cost
# Subsequent calls within cache window: ~10% of input token cost for the system prompt
Caching pays off when the same prompt prefix is reused across many requests. The break-even is roughly 4β5 calls. For a 3,000-token system prompt called 1,000 times per day, caching saves around 90% of the input cost for that prefix.
What to cache:
- System prompts (always)
- Large reference documents injected into context
- Conversation history when doing multi-turn sessions with a long preamble
What not to cache:
- The userβs current message (it changes every time)
- Timestamps or other volatile content in the prompt prefix
Cost ceilings with graceful degradation
A hard limit that kills a request silently is worse than no limit. The user gets a broken experience with no explanation. Build a circuit breaker instead.
from dataclasses import dataclass
from enum import Enum
class DegradationLevel(Enum):
FULL = "full"
REDUCED_CONTEXT = "reduced_context"
FAST_MODEL = "fast_model"
CACHED_ONLY = "cached_only"
DECLINED = "declined"
@dataclass
class CostBudget:
per_request_usd: float
daily_usd: float
spent_today_usd: float = 0.0
def get_degradation_level(budget: CostBudget, estimated_cost_usd: float) -> DegradationLevel:
remaining_daily = budget.daily_usd - budget.spent_today_usd
if estimated_cost_usd > budget.per_request_usd:
if estimated_cost_usd * 0.5 <= budget.per_request_usd:
return DegradationLevel.REDUCED_CONTEXT
return DegradationLevel.FAST_MODEL
if remaining_daily < budget.daily_usd * 0.1: # under 10% daily budget left
return DegradationLevel.FAST_MODEL
if remaining_daily <= 0:
return DegradationLevel.DECLINED
return DegradationLevel.FULL
def handle_request_with_budget(query: str, budget: CostBudget, context: list[str]) -> str:
estimated_cost = estimate_cost(query, context)
level = get_degradation_level(budget, estimated_cost)
if level == DegradationLevel.FULL:
return call_full_model(query, context)
elif level == DegradationLevel.REDUCED_CONTEXT:
# Trim context to fit budget
trimmed_context = context[:len(context)//2]
return call_full_model(query, trimmed_context)
elif level == DegradationLevel.FAST_MODEL:
# Downgrade to a cheaper model
return call_fast_model(query, context)
elif level == DegradationLevel.DECLINED:
# Explicit, actionable failure message
return "Request declined: daily budget exhausted. Resets at midnight UTC."
def estimate_cost(query: str, context: list[str], cost_per_1k_tokens: float = 0.003) -> float:
total_chars = len(query) + sum(len(c) for c in context)
estimated_tokens = total_chars / 4 # rough char-to-token ratio
return (estimated_tokens / 1000) * cost_per_1k_tokens
Per-team cost attribution
In a shared platform, you need to know which team or product is spending what. Implement attribution at the gateway layer, not in individual services.
import time
import sqlite3
from contextlib import contextmanager
class CostAttributionStore:
def __init__(self, db_path: str = "costs.db"):
self.conn = sqlite3.connect(db_path, check_same_thread=False)
self.conn.execute("""
CREATE TABLE IF NOT EXISTS usage (
ts INTEGER,
team_id TEXT,
project_id TEXT,
model TEXT,
input_tokens INTEGER,
output_tokens INTEGER,
cached_tokens INTEGER,
cost_usd REAL
)
""")
def record(self, team_id: str, project_id: str, model: str,
input_tokens: int, output_tokens: int,
cached_tokens: int = 0) -> float:
# Example pricing β always fetch actual rates from provider
rates = {
"claude-sonnet-4-5": {"input": 0.003, "output": 0.015, "cache": 0.0003},
}
rate = rates.get(model, {"input": 0.005, "output": 0.015, "cache": 0.0005})
cost = (
(input_tokens / 1000) * rate["input"]
+ (output_tokens / 1000) * rate["output"]
+ (cached_tokens / 1000) * rate["cache"]
)
self.conn.execute(
"INSERT INTO usage VALUES (?, ?, ?, ?, ?, ?, ?, ?)",
(int(time.time()), team_id, project_id, model,
input_tokens, output_tokens, cached_tokens, cost)
)
self.conn.commit()
return cost
def daily_spend(self, team_id: str) -> float:
day_start = int(time.time()) - 86400
row = self.conn.execute(
"SELECT SUM(cost_usd) FROM usage WHERE team_id = ? AND ts > ?",
(team_id, day_start)
).fetchone()
return row[0] or 0.0
This pattern integrates cleanly into an LLM gateway (see Module 5.8). Every request passes through the gateway, which records attribution and enforces per-team budgets before the upstream API call is made.
Layer 3: Deep Dive
Why multi-agent billing is hard to model
Single-request cost modelling breaks down when agents spawn sub-agents. Consider a document analysis agent:
User request
βββ Orchestrator (1 call, ~2,000 input tokens)
βββ Extraction agent (5 calls Γ 1,500 tokens = 7,500 tokens)
βββ Validation agent (3 calls Γ 800 tokens = 2,400 tokens)
βββ Summary agent (1 call Γ 4,000 tokens = 4,000 tokens)
Total: ~15,900 input tokens for one user request
The orchestrator alone does not tell you the cost. Cost attribution must be tracked at the sub-agent level and rolled up to the originating user request. Implement a trace ID that propagates through all spawned calls β this doubles as your observability story (Module 6.4).
Failure taxonomy
Type 1: Silent ceiling hit. The agent stops when it hits a budget ceiling, but the user receives no message explaining what happened or what to do. Common in naive rate-limiting implementations.
Type 2: Context growth runaway. Multi-turn conversations where history is never trimmed. Each turn sends the full prior conversation. A 50-turn session with 1,000 tokens per turn accumulates 50,000 tokens of history by the final turn, most of it irrelevant.
Type 3: Tool call explosion. An agent in a retry loop calls the same tool dozens of times when it fails. Without per-tool-call cost tracking, this looks like a single expensive request rather than a bug.
Type 4: Cache invalidation surprise. A minor formatting change to a system prompt (a trailing space, a version number in a comment) breaks cache affinity and doubles input costs overnight without any other change.
Type 5: Model routing misconfiguration. A staging environment routed to a premium model gets left behind when a feature ships. Production calls start hitting the expensive model unintentionally.
Cost forecasting
Model cost growth as a function of usage, not just current spend. The key inputs:
- p95 token count per request type β measure this per feature, not globally
- Request volume growth rate β weekly average, smoothed
- Cache hit rate β a cache hit rate drop from 70% to 40% is a meaningful cost event
def forecast_monthly_cost(
daily_requests: int,
avg_input_tokens: int,
avg_output_tokens: int,
cache_hit_rate: float,
input_price_per_1k: float,
output_price_per_1k: float,
cache_price_per_1k: float,
growth_rate_weekly: float = 0.05
) -> dict:
days = 30
total_cost = 0.0
for day in range(days):
week = day // 7
volume = daily_requests * ((1 + growth_rate_weekly) ** week)
cached = volume * cache_hit_rate
uncached = volume - cached
day_cost = (
(uncached * avg_input_tokens / 1000) * input_price_per_1k
+ (cached * avg_input_tokens / 1000) * cache_price_per_1k
+ (volume * avg_output_tokens / 1000) * output_price_per_1k
)
total_cost += day_cost
return {
"monthly_usd": round(total_cost, 2),
"daily_avg_usd": round(total_cost / days, 2),
}
Further reading
- Anthropic. Prompt caching documentation. 2024. Canonical reference for cache_control semantics and cache lifetime behaviour.
- OpenAI. Prompt caching guide. 2024. OpenAIβs implementation; comparison useful for understanding provider-specific differences.
- Karpathy, Andrej. Tokenization. 2024. Comprehensive breakdown of how tokenizers work β essential for understanding why character-to-token ratios vary and why certain prompt patterns are expensive.