The economics of self-hosting AI

Self-hosting AI means running models on hardware you own or rent, rather than paying a vendor — OpenAI, Anthropic, Google — per request through their API (application programming interface — the way one piece of software calls another). Open-source models like Llama, Mistral, and Qwen have made this a viable option for plenty of workloads.

The “we should self-host to save money” decision has been the source of more under-cooked architectural choices than perhaps any other in modern infrastructure. The headline savings look obvious: per-token API costs add up at volume, GPUs are buyable, open-source models are competitive. The hidden costs — engineering time, ops overhead, capacity planning, model upgrades, electricity, depreciation — often eat the savings and produce a worse product at the same total cost. The opposite mistake is just as common: teams committed to cloud APIs at scales where self-hosting would save them five-to-ten times the infrastructure cost.

The honest answer is workload-by-workload, calculated on real numbers. This piece is the framework.

The mental model

The four cost components nobody fully accounts for

Self-hosting AI has four cost dimensions:

Capital cost. Hardware: GPUs, server, networking, storage. For meaningful AI inference at scale, this is $5,000–$80,000+ depending on capability. A single H100 server with proper supporting infra is $50,000+; consumer-grade workstations (RTX 4090 + workstation) can handle smaller models at $5,000–$10,000.
Operating cost. Electricity, cooling, hosting (if colocated rather than on-prem). A GPU server under load draws 400–700W continuously; at $0.10–$0.30 per kWh, the electricity is $300–$1,500/year per GPU. Colocation adds another $200–$500/month for proper datacenter conditions.
Engineering cost. The biggest hidden cost. Building and maintaining production inference infrastructure — deployment, monitoring, fail-over, load balancing, model upgrades, performance tuning — is a real engineering function. Conservatively 0.25–1 FTE for a small operation; full team for serious scale. At $150,000–$300,000/year per engineer, this dominates the math for small deployments.
Capacity-planning cost. Cloud APIs scale automatically; self-hosted does not. Over-provisioning is expensive (idle GPUs costing money); under-provisioning produces slow / failed inference under load. The capacity-planning work is ongoing, and getting it wrong has visible product consequences.

The first two are easy to calculate. The second two are where most break-even analyses go wrong; they’re treated as zero in the optimistic-self-host case and as infinite in the optimistic-API case. Both are wrong; the real numbers are in between.

The break-even framework

When self-hosting actually pays off

For a specific workload, calculate:

API cost. Tokens per request × requests per month × API price per token. Be honest about the model tier you’d need (flagship vs cheap).
Self-host capacity needed. The hardware to handle peak load (not average — peak; or accept slow response at peak).
Self-host capital amortised. Hardware cost / expected useful life (typically 3 years for GPUs).
Self-host operating cost. Electricity, hosting, monitoring infrastructure.
Self-host engineering allocation. Realistic FTE percentage × loaded salary.

Self-hosting wins when the self-host total is below the API total. Typical break-even thresholds:

Below $1,000–$2,000/month in API costs: Cloud almost always wins. Engineering overhead alone exceeds the saving.
$5,000–$15,000/month API costs: Borderline. Depends on workload predictability (high variance favours cloud), engineering capacity, and how much the workload would benefit from customisation only possible on self-hosted.
Above $25,000/month API costs: Self-hosting often wins for predictable workloads. The capital cost amortises against the API spend within months.
Above $100,000/month API costs: Self-hosting almost always wins for the predictable bulk of the workload. Hybrid is common (self-host for the bulk, API for spike capacity).

The break-even isn’t a single number; it depends on workload characteristics. Run it on your specific numbers.

The numbers

Realistic component costs

Entry-tier GPU workstation (RTX 4090, can run ~70B-class models) $5,000–$10,000 capital, 3-year useful life

Production single-GPU server (A100 / H100) $25,000–$50,000 capital plus rack, networking, storage

Multi-GPU production rig (8x H100 / H200) $200,000–$500,000+ capital

Cloud GPU rental — H100 instance $3–$8 per hour ($2,200–$5,800 per month at continuous use)

Electricity per GPU server (continuous operation) $30–$100 per month

Colocation cost (small rack with GPU server) $500–$2,000 per month

Engineering time — small self-host operation 0.25–1 FTE; $40,000–$300,000 per year loaded

API inference cost — small/mid-tier model $0.10–$1 per million input tokens

API inference cost — flagship model $3–$15 per million input tokens; $15–$75 per million output

Break-even where self-host typically wins Above ~$5,000–$15,000/month in flagship-API spend for predictable workloads

Engineering effort to deploy production self-hosted inference 2–6 weeks initial setup; 0.25–1 FTE ongoing

The capital cost amortised over 3 years is often smaller than expected; the ongoing engineering cost is often larger than expected. Calculate both honestly.

In practice

What teams running these calculations typically learn first

Teams come in expecting infrastructure costs to dominate; what they find is that engineering cost dominates the equation for small operations. A team spending $3,000/month on API costs and considering self-hosting often discovers that the engineering time to set up and maintain self-hosted inference exceeds the API spend they’d be saving. Self-hosting is a meaningful financial win primarily at scale; below the threshold, the simpler API path is also the cheaper one once engineering time is included.

A few weeks of running the numbers later, workload characteristics matter as much as the volume. Predictable, steady workloads (always-on classification, regular batch jobs) favour self-hosting; spiky, unpredictable workloads (consumer apps with usage variance) favour API. Provisioning for peak on self-hosted means most hardware is idle most of the time; the cloud’s auto-scaling has real value for variable workloads.

Six months in, the conclusion lands: hybrid wins for most growing teams. Self-host the predictable bulk of the workload; use API for spike capacity, new feature experimentation, and capability gaps. The combination produces lower total cost than either pure strategy at most scales. The architectural cost of running both is modest; the cost savings compound.

Common mistakes

Where the calculation typically goes wrong

Ignoring engineering time entirely. The most common error. Self-hosting requires ongoing engineering investment; pretending it doesn’t produces a calculation that favours self-hosting at scales where it shouldn’t.

Comparing against the wrong API tier. Some workloads need flagship models; some don’t. Compare against the realistic API tier for your workload, not the cheapest available.

Over-provisioning for hypothetical peak. Provisioning self-hosted infrastructure for “what if usage 10x’s overnight” produces expensive idle capacity. Plan for realistic growth, not the imagination.

Underestimating capability gaps. Open-source models are catching up to flagship proprietary but still trail on the hardest reasoning. Self-hosting saves money on routine work; using self-hosted for the workloads that genuinely need flagship capability produces a worse product.

Treating it as a permanent decision. The math changes. Re-run it annually; capability and pricing both shift faster than self-hosting infrastructure decisions tend to be revisited.

What's next

Related work

For the broader open-source-vs-proprietary framework, see Open-source vs proprietary AI — practical tradeoffs. For the local-vs-cloud architectural choice, see When to run AI locally vs in the cloud. For the underlying token-and-cost mechanics, see Tokens, context windows, and what they cost. For the broader vendor-lock-in framework that interacts with self-hosting, see Vendor lock-in risks with AI.

Common questions

FAQ

What about renting GPUs on the cloud instead of buying?

Cloud GPU rental is the middle path. Higher per-hour cost than owning ($3–$8/hour for H100 vs amortised $1–$2/hour for owned); lower commitment than buying; better-than-API economics at high volume. Useful for: testing self-hosting before committing to hardware, handling variable workloads, accessing GPU capacity for short-term needs. Less economical than owning for steady high-volume use.

How do we handle model upgrades when self-hosting?

Manually. Open-source models release new versions on the lab's cadence (Llama every 6–12 months, Mistral similar). You update your deployed model, re-test on your workload, switch over. This is engineering work that's invisible when using APIs; the version-management overhead is one of the hidden costs of self-hosting.

Should we use specialised inference providers (Replicate, Together, Fireworks) instead of self-hosting?

Reasonable middle path. These providers host open-source models for you, charging per-call rather than per-hour. Cheaper than running idle GPU capacity; more expensive than self-hosting at steady high volume. Often the right answer for teams that want open-source models without operating the infrastructure.

What about using inference-optimisation tools (vLLM, TensorRT, llama.cpp)?

Materially affects the math. vLLM and similar tools improve inference throughput 2–4x on the same hardware, effectively making each GPU much more productive. Include these in your capacity-planning calculation; without them, you'd need more hardware than the optimised case requires.

The four cost components nobody fully accounts for

When self-hosting actually pays off

Realistic component costs

Where the calculation typically goes wrong

Related work

FAQ

What about renting GPUs on the cloud instead of buying?

How do we handle model upgrades when self-hosting?

Should we use specialised inference providers (Replicate, Together, Fireworks) instead of self-hosting?

What about using inference-optimisation tools (vLLM, TensorRT, llama.cpp)?

Sources & references

Related solutions

AI hallucinations explained

AI privacy — what to watch for

AI procurement checklist for non-technical buyers

AI risk assessment for legal and compliance teams