🤖 AI Explained
Fast-moving: verify before relying on this 6 min read

Models and Model Selection

Not every task needs the most capable model. Understanding the capability-cost-latency tradeoff lets you pick the right model for each job, and avoid paying frontier prices for work a smaller model handles just as well.

Layer 1: Surface

Models are not interchangeable. Every model sits somewhere on three axes: capability, cost, and latency. Picking the right model is one of the highest-leverage decisions in an AI system: not because cheaper is always better, but because over-capability is real: using a frontier model for a task a smaller one handles just as well is both slower and more expensive.

The rough hierarchy all the major providers offer:

TierWhen to useExamples
FrontierComplex reasoning, multi-step planning, novel tasks with no clear templateclaude-opus-4-6, gpt-4.5, gemini-ultra
BalancedProduction workloads where quality and cost both matterclaude-sonnet-4-6, gpt-4o, gemini-1.5-pro
Fast / smallHigh-volume, low-complexity tasks: classification, routing, extractionclaude-haiku-4-5, gpt-4o-mini, gemini-1.5-flash

A practical starting point: prototype with the frontier model, then work downward until quality degrades past acceptable. You often find the balanced tier handles 80–90% of tasks at a fraction of the cost.

Production Gotcha

Common Gotcha: Model aliases move when providers ship improvements. gpt-4o, claude-sonnet-4-6, and similar names are convenient but unpredictable in production: a silent update can change your output distribution overnight. Pin versioned model IDs in production and CI; discover changes on your schedule, not when a user reports degraded output.


Layer 2: Guided

The three axes

Capability is what the model can reliably do: follow complex multi-step instructions, reason about novel problems, produce consistent structured output, handle long documents. Frontier models are better at all of these. But for well-defined tasks with clear templates, a smaller model can perform at parity.

Cost scales with model size. Fast-tier models can be 10–20× cheaper per token than frontier-tier. For high-volume features (every page load, every API request), that difference is the difference between a feature that’s economically viable and one that isn’t.

Latency correlates with model size. Smaller models return tokens faster. For interactive features (chat, autocomplete, real-time suggestions) latency matters as much as cost. A frontier model that takes 8 seconds to respond is often worse UX than a balanced model that responds in 2 seconds with slightly lower quality.

Routing by task type

A common production pattern routes requests to different model tiers based on complexity:

# --- pseudocode ---
def classify_intent(user_message: str) -> str:
    # Fast, cheap classifier — use the small model for this
    response = llm.chat(
        model="fast",
        system="Classify the user message. Reply with only the label.\nCategories: SIMPLE_FAQ, COMPLEX_ANALYSIS, DATA_EXTRACTION",
        messages=[{"role": "user", "content": user_message}],
        max_tokens=16,
    )
    return response.text.strip()

def answer(user_message: str) -> str:
    intent = classify_intent(user_message)

    model = {
        "COMPLEX_ANALYSIS": "frontier",   # hard tasks need the big model
        "DATA_EXTRACTION":  "balanced",   # structured work — balanced is fine
    }.get(intent, "fast")                 # simple FAQs — stay cheap

    response = llm.chat(model=model, messages=[{"role": "user", "content": user_message}], max_tokens=1024)
    return response.text

In practice: Anthropic SDK:

import anthropic

client = anthropic.Anthropic()

def classify_intent(user_message: str) -> str:
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=16,
        system="Classify the user message into one of these categories. Reply with only the category label.\n\nCategories: SIMPLE_FAQ, COMPLEX_ANALYSIS, DATA_EXTRACTION",
        messages=[{"role": "user", "content": user_message}]
    )
    return response.content[0].text.strip()

def answer(user_message: str) -> str:
    intent = classify_intent(user_message)

    if intent == "COMPLEX_ANALYSIS":
        model = "claude-opus-4-6"            # pin to versioned ID in production
    elif intent == "DATA_EXTRACTION":
        model = "claude-sonnet-4-6"          # pin to versioned ID in production
    else:
        model = "claude-haiku-4-5-20251001"

    response = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{"role": "user", "content": user_message}]
    )
    return response.content[0].text

The classifier call itself costs almost nothing; the savings on routing simple requests away from expensive models compound over millions of calls. The same pattern works with OpenAI (gpt-4o-minigpt-4ogpt-4.5) or any provider that offers multiple tiers.

Evaluating model suitability

Don’t guess: measure. Build a small evaluation set (50–200 representative inputs with expected outputs) and score each model tier against it. The right questions are:

  1. Does the output meet the quality bar at this tier?
  2. What is the cost per 1,000 requests at this tier?
  3. What is the p50 / p95 latency at this tier?

A model that achieves 95% of the quality at 15% of the cost is almost always the right choice for a production feature, especially at scale.

Before vs After

Over-capability: using frontier for simple routing:

# BAD: Paying frontier prices to classify support tickets into 5 buckets
response = llm.chat(
    model="frontier",   # massive overkill for a classification task
    messages=[{"role": "user", "content": ticket_text}],
    system="Classify this ticket: BUG, BILLING, FEATURE, ACCOUNT, OTHER",
    max_tokens=8,
)

Right-sized: fast model for classification:

# GOOD: Same task, 10–20x cheaper, faster response
response = llm.chat(
    model="fast",
    messages=[{"role": "user", "content": ticket_text}],
    system="Classify this ticket: BUG, BILLING, FEATURE, ACCOUNT, OTHER",
    max_tokens=8,
)

Common mistakes

  1. Using frontier for everything: Simplest to start with, but costly and slow at scale. Graduate tasks to smaller models once you have an eval.
  2. Using floating model aliases in production: Aliases like gpt-4o or claude-sonnet-4-6 can update silently. Pin versioned IDs.
  3. Comparing models without an eval set: Vibes-based model selection. Build even a simple eval before committing to a model for a production feature.
  4. Ignoring output token costs: Smaller models are cheaper on input; the ratio differs on output. Check both.
  5. One model for all tasks in an application: Most non-trivial applications have multiple task types with different quality requirements. A single model choice is usually a compromise in the wrong direction for at least one task.

Layer 3: Deep Dive

How capability differences manifest

Frontier models outperform smaller models most visibly on:

  • Instruction-following on complex, multi-constraint prompts: When the system prompt has many rules and the request has unusual edge cases, smaller models drop constraints more often.
  • Long-context coherence: Over long documents (100K+ tokens), frontier models maintain more consistent reasoning and miss fewer details.
  • Novel task generalisation: For tasks outside the common distribution of training data, frontier models hallucinate less and request clarification more reliably.
  • Code generation on complex tasks: Multi-file refactors, algorithmic problems, language interop. Smaller models produce more plausible-but-wrong code.

For well-defined, high-volume tasks with clear templates, classification, extraction, summarisation from a fixed schema, quality gaps narrow substantially.

Context window differences

Different models within the same provider family may have different context windows. Always check current documentation: context limits change with model updates. For tasks requiring very long context (full codebase analysis, long document Q&A), verify the target model’s limit before committing to an architecture.

Versioning and stability

Model providers use two naming conventions:

  • Alias (e.g. claude-opus-4-6, gpt-4o, gemini-1.5-pro); Always points to the current recommended version. Updates when the provider ships improvements. Simple to use; unpredictable in production.
  • Pinned version (e.g. claude-haiku-4-5-20251001, gpt-4o-2024-11-20); Fixed to a specific release. Output distribution only changes when you change it. Required for production. Check your provider’s models page for the current versioned IDs.

A robust model management strategy:

  1. Use pinned versions in production and CI
  2. Subscribe to your provider’s changelog and model deprecation notices
  3. When a new version ships, run your eval set against it before promoting
  4. Treat model upgrades as software deployments: tested, staged, rollback-capable

The cost of capability at scale

To make the economics concrete, here’s a rough comparison using Anthropic’s March 2026 pricing: other providers are in a similar range, and the ratio between tiers matters more than the absolute numbers:

Model tierInput cost (Anthropic, March 2026)1M requests × 500 tokens
Frontier$5 / MTok~$2,500
Balanced$3 / MTok~$1,500
Fast$1 / MTok~$500

Always verify pricing directly with your provider: rates change. The key takeaway is the 5× gap between frontier and fast tiers, which applies across most providers.

At 1M requests per day (a moderate consumer product), the difference between fast and frontier for a classification task is ~$730K annually. Model selection is an engineering decision with real P&L impact: whatever provider you use.

Fine-tuning vs prompt engineering vs model selection

When a model underperforms on your task, there are three levers:

LeverWhen to reach for itCost and complexity
Better promptAlmost always: try this firstCheapest; often sufficient
Different model tierWhen the task is genuinely hard or genuinely simpleMedium; requires eval
Fine-tuningWhen you have hundreds of curated examples and need consistent style or domain knowledge the base model lacksHighest; significant data and ops investment

Fine-tuning is often reached for too early. A well-constructed prompt with few-shot examples frequently matches fine-tuned performance on classification and extraction tasks, with none of the data collection or retraining overhead.

Further reading

✏ Suggest an edit on GitHub

Models and Model Selection: Check your understanding

Q1

Your application classifies incoming support tickets into 5 categories. It processes 2 million tickets per day. Which model selection strategy is most appropriate?

Q2

You deploy a feature using a floating model alias, for example 'gpt-4o' or 'claude-opus-4-6', without pinning a specific version. Three months later, users report the output style has changed. What most likely happened?

Q3

Which task is most likely to show a meaningful quality gap between a frontier model and a fast/small model?

Q4

A model performs poorly on your classification task. In what order should you try these interventions?

Q5

What is the primary purpose of building an evaluation set before selecting a model for a production feature?