Layer 1: Surface
Models are not interchangeable. Every model sits somewhere on three axes: capability, cost, and latency. Picking the right model is one of the highest-leverage decisions in an AI system: not because cheaper is always better, but because over-capability is real: using a frontier model for a task a smaller one handles just as well is both slower and more expensive.
The rough hierarchy all the major providers offer:
| Tier | When to use | Examples |
|---|---|---|
| Frontier | Complex reasoning, multi-step planning, novel tasks with no clear template | claude-opus-4-6, gpt-4.5, gemini-ultra |
| Balanced | Production workloads where quality and cost both matter | claude-sonnet-4-6, gpt-4o, gemini-1.5-pro |
| Fast / small | High-volume, low-complexity tasks: classification, routing, extraction | claude-haiku-4-5, gpt-4o-mini, gemini-1.5-flash |
A practical starting point: prototype with the frontier model, then work downward until quality degrades past acceptable. You often find the balanced tier handles 80–90% of tasks at a fraction of the cost.
Production Gotcha
Common Gotcha: Model aliases move when providers ship improvements.
gpt-4o,claude-sonnet-4-6, and similar names are convenient but unpredictable in production: a silent update can change your output distribution overnight. Pin versioned model IDs in production and CI; discover changes on your schedule, not when a user reports degraded output.
Layer 2: Guided
The three axes
Capability is what the model can reliably do: follow complex multi-step instructions, reason about novel problems, produce consistent structured output, handle long documents. Frontier models are better at all of these. But for well-defined tasks with clear templates, a smaller model can perform at parity.
Cost scales with model size. Fast-tier models can be 10–20× cheaper per token than frontier-tier. For high-volume features (every page load, every API request), that difference is the difference between a feature that’s economically viable and one that isn’t.
Latency correlates with model size. Smaller models return tokens faster. For interactive features (chat, autocomplete, real-time suggestions) latency matters as much as cost. A frontier model that takes 8 seconds to respond is often worse UX than a balanced model that responds in 2 seconds with slightly lower quality.
Routing by task type
A common production pattern routes requests to different model tiers based on complexity:
# --- pseudocode ---
def classify_intent(user_message: str) -> str:
# Fast, cheap classifier — use the small model for this
response = llm.chat(
model="fast",
system="Classify the user message. Reply with only the label.\nCategories: SIMPLE_FAQ, COMPLEX_ANALYSIS, DATA_EXTRACTION",
messages=[{"role": "user", "content": user_message}],
max_tokens=16,
)
return response.text.strip()
def answer(user_message: str) -> str:
intent = classify_intent(user_message)
model = {
"COMPLEX_ANALYSIS": "frontier", # hard tasks need the big model
"DATA_EXTRACTION": "balanced", # structured work — balanced is fine
}.get(intent, "fast") # simple FAQs — stay cheap
response = llm.chat(model=model, messages=[{"role": "user", "content": user_message}], max_tokens=1024)
return response.text
In practice: Anthropic SDK:
import anthropic
client = anthropic.Anthropic()
def classify_intent(user_message: str) -> str:
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=16,
system="Classify the user message into one of these categories. Reply with only the category label.\n\nCategories: SIMPLE_FAQ, COMPLEX_ANALYSIS, DATA_EXTRACTION",
messages=[{"role": "user", "content": user_message}]
)
return response.content[0].text.strip()
def answer(user_message: str) -> str:
intent = classify_intent(user_message)
if intent == "COMPLEX_ANALYSIS":
model = "claude-opus-4-6" # pin to versioned ID in production
elif intent == "DATA_EXTRACTION":
model = "claude-sonnet-4-6" # pin to versioned ID in production
else:
model = "claude-haiku-4-5-20251001"
response = client.messages.create(
model=model,
max_tokens=1024,
messages=[{"role": "user", "content": user_message}]
)
return response.content[0].text
The classifier call itself costs almost nothing; the savings on routing simple requests away from expensive models compound over millions of calls. The same pattern works with OpenAI (gpt-4o-mini → gpt-4o → gpt-4.5) or any provider that offers multiple tiers.
Evaluating model suitability
Don’t guess: measure. Build a small evaluation set (50–200 representative inputs with expected outputs) and score each model tier against it. The right questions are:
- Does the output meet the quality bar at this tier?
- What is the cost per 1,000 requests at this tier?
- What is the p50 / p95 latency at this tier?
A model that achieves 95% of the quality at 15% of the cost is almost always the right choice for a production feature, especially at scale.
Before vs After
Over-capability: using frontier for simple routing:
# BAD: Paying frontier prices to classify support tickets into 5 buckets
response = llm.chat(
model="frontier", # massive overkill for a classification task
messages=[{"role": "user", "content": ticket_text}],
system="Classify this ticket: BUG, BILLING, FEATURE, ACCOUNT, OTHER",
max_tokens=8,
)
Right-sized: fast model for classification:
# GOOD: Same task, 10–20x cheaper, faster response
response = llm.chat(
model="fast",
messages=[{"role": "user", "content": ticket_text}],
system="Classify this ticket: BUG, BILLING, FEATURE, ACCOUNT, OTHER",
max_tokens=8,
)
Common mistakes
- Using frontier for everything: Simplest to start with, but costly and slow at scale. Graduate tasks to smaller models once you have an eval.
- Using floating model aliases in production: Aliases like
gpt-4oorclaude-sonnet-4-6can update silently. Pin versioned IDs. - Comparing models without an eval set: Vibes-based model selection. Build even a simple eval before committing to a model for a production feature.
- Ignoring output token costs: Smaller models are cheaper on input; the ratio differs on output. Check both.
- One model for all tasks in an application: Most non-trivial applications have multiple task types with different quality requirements. A single model choice is usually a compromise in the wrong direction for at least one task.
Layer 3: Deep Dive
How capability differences manifest
Frontier models outperform smaller models most visibly on:
- Instruction-following on complex, multi-constraint prompts: When the system prompt has many rules and the request has unusual edge cases, smaller models drop constraints more often.
- Long-context coherence: Over long documents (100K+ tokens), frontier models maintain more consistent reasoning and miss fewer details.
- Novel task generalisation: For tasks outside the common distribution of training data, frontier models hallucinate less and request clarification more reliably.
- Code generation on complex tasks: Multi-file refactors, algorithmic problems, language interop. Smaller models produce more plausible-but-wrong code.
For well-defined, high-volume tasks with clear templates, classification, extraction, summarisation from a fixed schema, quality gaps narrow substantially.
Context window differences
Different models within the same provider family may have different context windows. Always check current documentation: context limits change with model updates. For tasks requiring very long context (full codebase analysis, long document Q&A), verify the target model’s limit before committing to an architecture.
Versioning and stability
Model providers use two naming conventions:
- Alias (e.g.
claude-opus-4-6,gpt-4o,gemini-1.5-pro); Always points to the current recommended version. Updates when the provider ships improvements. Simple to use; unpredictable in production. - Pinned version (e.g.
claude-haiku-4-5-20251001,gpt-4o-2024-11-20); Fixed to a specific release. Output distribution only changes when you change it. Required for production. Check your provider’s models page for the current versioned IDs.
A robust model management strategy:
- Use pinned versions in production and CI
- Subscribe to your provider’s changelog and model deprecation notices
- When a new version ships, run your eval set against it before promoting
- Treat model upgrades as software deployments: tested, staged, rollback-capable
The cost of capability at scale
To make the economics concrete, here’s a rough comparison using Anthropic’s March 2026 pricing: other providers are in a similar range, and the ratio between tiers matters more than the absolute numbers:
| Model tier | Input cost (Anthropic, March 2026) | 1M requests × 500 tokens |
|---|---|---|
| Frontier | $5 / MTok | ~$2,500 |
| Balanced | $3 / MTok | ~$1,500 |
| Fast | $1 / MTok | ~$500 |
Always verify pricing directly with your provider: rates change. The key takeaway is the 5× gap between frontier and fast tiers, which applies across most providers.
At 1M requests per day (a moderate consumer product), the difference between fast and frontier for a classification task is ~$730K annually. Model selection is an engineering decision with real P&L impact: whatever provider you use.
Fine-tuning vs prompt engineering vs model selection
When a model underperforms on your task, there are three levers:
| Lever | When to reach for it | Cost and complexity |
|---|---|---|
| Better prompt | Almost always: try this first | Cheapest; often sufficient |
| Different model tier | When the task is genuinely hard or genuinely simple | Medium; requires eval |
| Fine-tuning | When you have hundreds of curated examples and need consistent style or domain knowledge the base model lacks | Highest; significant data and ops investment |
Fine-tuning is often reached for too early. A well-constructed prompt with few-shot examples frequently matches fine-tuned performance on classification and extraction tasks, with none of the data collection or retraining overhead.
Further reading
- Scaling Laws for Neural Language Models; Kaplan et al., 2020. The foundational paper on how capability scales with model size, data, and compute.
- Chinchilla: Training Compute-Optimal Large Language Models; Hoffmann et al., 2022. Revised the scaling laws and influenced how modern models are trained.
- Model overviews: each provider maintains a current list of models, versioned IDs, and context windows: Anthropic · OpenAI · Google.