🤖 AI Explained
Emerging area 5 min read

Hosting Options

Choosing where to run your model determines your cost structure, latency floor, and operational burden: understanding the tradeoffs between API inference, self-hosted, and cloud-managed endpoints lets you pick the right option for each workload rather than defaulting to whatever is easiest to start.

Layer 1: Surface

There are three fundamentally different ways to run an LLM in production: pay a provider per token (API inference), run the model on your own hardware (self-hosted), or use a cloud service that manages the infrastructure but lets you bring your own model (cloud-managed endpoints like AWS SageMaker or Google Vertex AI: these are examples of the category, not the only options).

What each actually means:

  • API inference: you send requests to a hosted endpoint (OpenAI, Anthropic, Cohere, etc.) and pay per token. Zero infrastructure to manage. The provider handles availability, scaling, and model updates.
  • Self-hosted: you run the model on GPU instances you control. You pay for compute regardless of utilization and own every failure. In exchange, you get full control over the model, its weights, and the serving configuration.
  • Cloud-managed endpoints: a hybrid. The cloud provider manages the compute layer (GPU provisioning, auto-scaling, health checks) while you supply the model weights and serving container. More control than API inference, less operational burden than pure self-hosted.

Why it matters

The wrong choice doesn’t show up immediately: it shows up at scale. Teams that start with API inference sometimes stay there too long and overpay by 10x. Teams that jump to self-hosted before they have the volume to justify it waste engineering effort managing infrastructure that adds no product value. The decision changes as your workload grows.

Production Gotcha

Common Gotcha: API inference costs look cheap per call but become expensive at scale before self-hosted amortizes. The crossover point depends on request volume and model size: most teams hit it later than expected and over-engineer too early. Run the math at your actual projected volume before committing to self-hosted infrastructure; the operational overhead of running GPU instances 24/7 is real and often underestimated.

The mistake is modeling cost at current volume, not projected volume. A team processing 100K requests/day may see self-hosted break even at 500K requests/day: which sounds close. But getting from 100K to 500K takes longer than expected, and in the meantime the self-hosted infrastructure needs to be built, operated, and maintained. Most teams should stay on API inference longer than feels comfortable.


Layer 2: Guided

Decision matrix

The five variables that drive the hosting decision:

FactorAPI InferenceSelf-HostedCloud-Managed
Latency SLAProvider-controlled; typically 100–500ms TTFT for frontier modelsYou control serving config; can be sub-50ms for smaller modelsProvider manages infra; latency similar to self-hosted
Data residencyData leaves your perimeter; provider processes itData stays on your hardware; full controlData stays in your cloud account; check provider’s data handling
Cost at scaleLinear with tokens; expensive at high volumeFixed GPU costs; cheap per-token at high utilizationFixed + managed overhead; typically 20–40% more than raw self-hosted
Operational burdenNear zero; provider handles all infrastructureHigh; you own availability, scaling, updates, driver managementModerate; provider handles compute, you handle model/container
Model flexibilityLimited to provider’s model catalogAny model with available weightsAny model you can containerize

When each makes sense

from dataclasses import dataclass

@dataclass
class WorkloadProfile:
    daily_requests: int
    avg_tokens_per_request: int   # input + output combined
    latency_sla_ms: int           # max acceptable TTFT
    data_sensitivity: str         # "public", "internal", "regulated"
    team_ml_ops_capacity: str     # "none", "limited", "dedicated"

def recommend_hosting(profile: WorkloadProfile) -> str:
    """
    Rough decision logic — real decisions require cost modeling at your specific prices.
    """
    # Hard constraints first
    if profile.data_sensitivity == "regulated":
        # HIPAA, FedRAMP, etc. — data cannot leave your boundary
        if profile.team_ml_ops_capacity == "none":
            return "cloud-managed"  # need the control without the ops burden
        return "self-hosted"

    # Volume-based cost crossover (rough heuristic)
    monthly_tokens = profile.daily_requests * profile.avg_tokens_per_request * 30

    if monthly_tokens < 500_000_000:  # under ~500M tokens/month
        return "api-inference"        # ops savings outweigh cost premium

    if monthly_tokens > 5_000_000_000:  # over ~5B tokens/month
        if profile.team_ml_ops_capacity == "dedicated":
            return "self-hosted"      # volume justifies ops investment
        return "cloud-managed"        # volume justifies it, but need managed ops

    # Mid-range: depends on ops capacity
    if profile.team_ml_ops_capacity == "dedicated":
        return "self-hosted"
    return "api-inference"            # default: avoid ops complexity until forced

# Example: early-stage product
startup_profile = WorkloadProfile(
    daily_requests=10_000,
    avg_tokens_per_request=2_000,
    latency_sla_ms=500,
    data_sensitivity="internal",
    team_ml_ops_capacity="limited",
)
print(recommend_hosting(startup_profile))  # "api-inference"

# Example: high-volume internal tool, sensitive data
enterprise_profile = WorkloadProfile(
    daily_requests=500_000,
    avg_tokens_per_request=3_000,
    latency_sla_ms=200,
    data_sensitivity="regulated",
    team_ml_ops_capacity="dedicated",
)
print(recommend_hosting(enterprise_profile))  # "self-hosted"

Cost modeling: the crossover calculation

Before committing to self-hosted, model the actual crossover:

def calculate_crossover_volume(
    api_price_per_mtok: float,          # e.g., 0.15 (input) to 0.60 (output) $/M tokens
    self_hosted_gpu_cost_per_hour: float,  # e.g., $2.50/hr for an A10G
    tokens_per_hour_per_gpu: int,       # throughput you can achieve
    num_gpus: int,                      # minimum cluster size for your model
) -> dict:
    """
    Find the monthly token volume where self-hosted becomes cheaper than API inference.
    """
    monthly_gpu_cost = self_hosted_gpu_cost_per_hour * 24 * 30 * num_gpus
    cost_per_mtok_self_hosted = (monthly_gpu_cost / (tokens_per_hour_per_gpu * 24 * 30 * num_gpus)) * 1_000_000

    # Crossover: monthly_tokens * api_price_per_mtok == monthly_gpu_cost
    crossover_mtok = monthly_gpu_cost / api_price_per_mtok
    crossover_monthly_tokens = crossover_mtok * 1_000_000

    return {
        "monthly_gpu_cost_usd": monthly_gpu_cost,
        "self_hosted_cost_per_mtok": round(cost_per_mtok_self_hosted, 4),
        "api_cost_per_mtok": api_price_per_mtok,
        "crossover_monthly_tokens": int(crossover_monthly_tokens),
        "crossover_daily_tokens": int(crossover_monthly_tokens / 30),
    }

# Example: 7B model, single A10G
result = calculate_crossover_volume(
    api_price_per_mtok=0.30,
    self_hosted_gpu_cost_per_hour=2.50,
    tokens_per_hour_per_gpu=4_000_000,
    num_gpus=1,
)
# crossover at ~1.5B tokens/month; ~50M tokens/day

Data residency and compliance

For regulated workloads (healthcare, finance, government), data residency is not a tradeoff: it is a hard constraint. API inference means your data transits a third-party provider’s infrastructure. Even with a DPA (Data Processing Agreement) and SOC 2 certification, some regulatory frameworks require data to remain under your direct control.

Cloud-managed endpoints (your model, deployed in your cloud account) often satisfy these requirements while avoiding the full operational burden of self-hosted. Check your specific regulatory requirements: “cloud” does not automatically mean non-compliant, but it requires due diligence.


Layer 3: Deep Dive

The hidden costs of self-hosted

The GPU cost is visible. These costs often are not:

Hidden costTypical impact
Engineering timeModel serving, autoscaling, monitoring: typically 1–2 engineer-months to get right
GPU idle timeA cluster that sits at 20% utilization overnight still pays full GPU-hour rates
Driver and CUDA maintenanceNVIDIA driver updates can break serving stacks; requires validation before rolling out
Model update overheadSwitching to a new model version requires re-testing serving config, quantization, and performance
On-call burdenGPU instance failures, OOM crashes, serving process hangs: someone has to respond

For teams without dedicated MLOps capacity, these costs frequently exceed the savings from lower per-token prices.

Hybrid architectures

Most production systems at scale end up with a hybrid:

Workload typeRecommended hosting
Interactive user-facing queries (latency-sensitive)API inference or cloud-managed
High-volume batch processing (cost-sensitive)Self-hosted with spot instances
Regulated data processingSelf-hosted or cloud-managed in compliant region
Experimental / fine-tuned modelsSelf-hosted (model not available via API)
Fallback / overflow capacityAPI inference (no minimum commitment)

The pattern: use API inference for the real-time tier where latency and availability matter most, and self-hosted spot instances for background batch jobs where cost matters more than tail latency.

Model selection flexibility tradeoff

API inference restricts you to the provider’s model catalog. This matters in two scenarios:

  1. Fine-tuned models: if your use case requires a fine-tuned model, you must host it yourself or use a provider that supports custom model deployment.
  2. Open-weight models: Llama, Mistral, and other open-weight models may outperform closed models on your specific task. Many API providers also serve popular open-weight models, but if you need a specific version, a fine-tuned variant, or a model not in any provider’s catalog, self-hosted or cloud-managed deployment is the only option.

The flexibility argument for self-hosted is strongest when you have specific model requirements that no API provider can meet.

Further reading

✏ Suggest an edit on GitHub

Hosting Options: Check your understanding

Q1

A team is processing 200,000 requests per day with an average of 1,500 tokens per request. Their current API inference bill is growing rapidly. An engineer proposes switching to self-hosted on a 2-GPU cluster. What is the most important calculation to run before committing?

Q2

A healthcare company wants to use an LLM to process patient records. Their compliance team says patient data cannot leave their infrastructure. Which hosting option should they evaluate first?

Q3

A team is building a new internal tool. They have one ML engineer part-time. They estimate 20,000 requests per day initially, scaling to 100,000 in 6 months. They want the fastest time to production. What is the most appropriate starting point?

Q4

A team runs a nightly batch pipeline (latency-insensitive) and a real-time chat product (latency-sensitive). Their budget is constrained. Which hosting pattern best serves both workloads?

Q5

A team reports that switching from API inference to self-hosted cut their per-token cost by 70%. However, three months later, their total monthly AI infrastructure spend is actually higher than before. What most likely explains this?