Models and Model Selection: AI Explained

Layer 1: Surface

Models are not interchangeable. Every model sits somewhere on three axes: capability, cost, and latency. Picking the right model is one of the highest-leverage decisions in an AI system: not because cheaper is always better, but because over-capability is real: using a frontier model for a task a smaller one handles just as well is both slower and more expensive.

The rough hierarchy all the major providers offer:

Tier	When to use	Examples
Frontier	Complex reasoning, multi-step planning, novel tasks with no clear template	claude-opus-4-6, gpt-4.5, gemini-ultra
Balanced	Production workloads where quality and cost both matter	claude-sonnet-4-6, gpt-4o, gemini-1.5-pro
Fast / small	High-volume, low-complexity tasks: classification, routing, extraction	claude-haiku-4-5, gpt-4o-mini, gemini-1.5-flash

A practical starting point: prototype with the frontier model, then work downward until quality degrades past acceptable. You often find the balanced tier handles 80–90% of tasks at a fraction of the cost.

Production Gotcha

Common Gotcha: Model aliases move when providers ship improvements. gpt-4o, claude-sonnet-4-6, and similar names are convenient but unpredictable in production: a silent update can change your output distribution overnight. Pin versioned model IDs in production and CI; discover changes on your schedule, not when a user reports degraded output.

Layer 2: Guided

The three axes

Capability is what the model can reliably do: follow complex multi-step instructions, reason about novel problems, produce consistent structured output, handle long documents. Frontier models are better at all of these. But for well-defined tasks with clear templates, a smaller model can perform at parity.

Cost scales with model size. Fast-tier models can be 10–20× cheaper per token than frontier-tier. For high-volume features (every page load, every API request), that difference is the difference between a feature that’s economically viable and one that isn’t.

Latency correlates with model size. Smaller models return tokens faster. For interactive features (chat, autocomplete, real-time suggestions) latency matters as much as cost. A frontier model that takes 8 seconds to respond is often worse UX than a balanced model that responds in 2 seconds with slightly lower quality.

Routing by task type

A common production pattern routes requests to different model tiers based on complexity:

# --- pseudocode ---
def classify_intent(user_message: str) -> str:
    # Fast, cheap classifier — use the small model for this
    response = llm.chat(
        model="fast",
        system="Classify the user message. Reply with only the label.\nCategories: SIMPLE_FAQ, COMPLEX_ANALYSIS, DATA_EXTRACTION",
        messages=[{"role": "user", "content": user_message}],
        max_tokens=16,
    )
    return response.text.strip()

def answer(user_message: str) -> str:
    intent = classify_intent(user_message)

    model = {
        "COMPLEX_ANALYSIS": "frontier",   # hard tasks need the big model
        "DATA_EXTRACTION":  "balanced",   # structured work — balanced is fine
    }.get(intent, "fast")                 # simple FAQs — stay cheap

    response = llm.chat(model=model, messages=[{"role": "user", "content": user_message}], max_tokens=1024)
    return response.text

In practice: Anthropic SDK:

import anthropic

client = anthropic.Anthropic()

def classify_intent(user_message: str) -> str:
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=16,
        system="Classify the user message into one of these categories. Reply with only the category label.\n\nCategories: SIMPLE_FAQ, COMPLEX_ANALYSIS, DATA_EXTRACTION",
        messages=[{"role": "user", "content": user_message}]
    )
    return response.content[0].text.strip()

def answer(user_message: str) -> str:
    intent = classify_intent(user_message)

    if intent == "COMPLEX_ANALYSIS":
        model = "claude-opus-4-6"            # pin to versioned ID in production
    elif intent == "DATA_EXTRACTION":
        model = "claude-sonnet-4-6"          # pin to versioned ID in production
    else:
        model = "claude-haiku-4-5-20251001"

    response = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{"role": "user", "content": user_message}]
    )
    return response.content[0].text

The classifier call itself costs almost nothing; the savings on routing simple requests away from expensive models compound over millions of calls. The same pattern works with OpenAI (gpt-4o-mini → gpt-4o → gpt-4.5) or any provider that offers multiple tiers.

Evaluating model suitability

Don’t guess: measure. Build a small evaluation set (50–200 representative inputs with expected outputs) and score each model tier against it. The right questions are:

Does the output meet the quality bar at this tier?
What is the cost per 1,000 requests at this tier?
What is the p50 / p95 latency at this tier?

A model that achieves 95% of the quality at 15% of the cost is almost always the right choice for a production feature, especially at scale.

Before vs After

Over-capability: using frontier for simple routing:

# BAD: Paying frontier prices to classify support tickets into 5 buckets
response = llm.chat(
    model="frontier",   # massive overkill for a classification task
    messages=[{"role": "user", "content": ticket_text}],
    system="Classify this ticket: BUG, BILLING, FEATURE, ACCOUNT, OTHER",
    max_tokens=8,
)

Right-sized: fast model for classification:

# GOOD: Same task, 10–20x cheaper, faster response
response = llm.chat(
    model="fast",
    messages=[{"role": "user", "content": ticket_text}],
    system="Classify this ticket: BUG, BILLING, FEATURE, ACCOUNT, OTHER",
    max_tokens=8,
)

Common mistakes

Using frontier for everything: Simplest to start with, but costly and slow at scale. Graduate tasks to smaller models once you have an eval.
Using floating model aliases in production: Aliases like gpt-4o or claude-sonnet-4-6 can update silently. Pin versioned IDs.
Comparing models without an eval set: Vibes-based model selection. Build even a simple eval before committing to a model for a production feature.
Ignoring output token costs: Smaller models are cheaper on input; the ratio differs on output. Check both.
One model for all tasks in an application: Most non-trivial applications have multiple task types with different quality requirements. A single model choice is usually a compromise in the wrong direction for at least one task.

Layer 3: Deep Dive

How capability differences manifest

Frontier models outperform smaller models most visibly on:

Instruction-following on complex, multi-constraint prompts: When the system prompt has many rules and the request has unusual edge cases, smaller models drop constraints more often.
Long-context coherence: Over long documents (100K+ tokens), frontier models maintain more consistent reasoning and miss fewer details.
Novel task generalisation: For tasks outside the common distribution of training data, frontier models hallucinate less and request clarification more reliably.
Code generation on complex tasks: Multi-file refactors, algorithmic problems, language interop. Smaller models produce more plausible-but-wrong code.

For well-defined, high-volume tasks with clear templates, classification, extraction, summarisation from a fixed schema, quality gaps narrow substantially.

Context window differences

Different models within the same provider family may have different context windows. Always check current documentation: context limits change with model updates. For tasks requiring very long context (full codebase analysis, long document Q&A), verify the target model’s limit before committing to an architecture.

Versioning and stability

Model providers use two naming conventions:

Alias (e.g. claude-opus-4-6, gpt-4o, gemini-1.5-pro); Always points to the current recommended version. Updates when the provider ships improvements. Simple to use; unpredictable in production.
Pinned version (e.g. claude-haiku-4-5-20251001, gpt-4o-2024-11-20); Fixed to a specific release. Output distribution only changes when you change it. Required for production. Check your provider’s models page for the current versioned IDs.

A robust model management strategy:

Use pinned versions in production and CI
Subscribe to your provider’s changelog and model deprecation notices
When a new version ships, run your eval set against it before promoting
Treat model upgrades as software deployments: tested, staged, rollback-capable

The cost of capability at scale

To make the economics concrete, here’s a rough comparison using Anthropic’s March 2026 pricing: other providers are in a similar range, and the ratio between tiers matters more than the absolute numbers:

Model tier	Input cost (Anthropic, March 2026)	1M requests × 500 tokens
Frontier	$5 / MTok	~$2,500
Balanced	$3 / MTok	~$1,500
Fast	$1 / MTok	~$500

Always verify pricing directly with your provider: rates change. The key takeaway is the 5× gap between frontier and fast tiers, which applies across most providers.

At 1M requests per day (a moderate consumer product), the difference between fast and frontier for a classification task is ~$730K annually. Model selection is an engineering decision with real P&L impact: whatever provider you use.

Fine-tuning vs prompt engineering vs model selection

When a model underperforms on your task, there are three levers:

Lever	When to reach for it	Cost and complexity
Better prompt	Almost always: try this first	Cheapest; often sufficient
Different model tier	When the task is genuinely hard or genuinely simple	Medium; requires eval
Fine-tuning	When you have hundreds of curated examples and need consistent style or domain knowledge the base model lacks	Highest; significant data and ops investment

Fine-tuning is often reached for too early. A well-constructed prompt with few-shot examples frequently matches fine-tuned performance on classification and extraction tasks, with none of the data collection or retraining overhead.

Models and Model Selection