Cyberax AI Playbook
cyberax.com
Explainer · Foundations · Local-OK

The economics of self-hosting AI

**Self-hosting AI** means running models on hardware you own (or rent), instead of paying a vendor like OpenAI or Anthropic per request through their API (the way one piece of software calls another). This page covers when the API math stops working and self-hosting starts paying off — with honest numbers on hardware, electricity, ops cost, and the engineering investment most teams underestimate.

At a glance Last verified · May 2026
Problem solved Run the honest economics on whether to self-host AI for a specific workload — comparing hardware, electricity, ops cost, and engineering investment against API costs at projected volume
Best for Engineering leaders, infrastructure ops, founders evaluating the cloud-vs-self-hosted decision, and CFOs facing growing AI infrastructure costs
Tools Llama, Mistral, Qwen, vLLM, Ollama, AWS, Azure, Google Cloud
Difficulty Intermediate
Cost $5,000 (entry workstation) → $20,000–$80,000 (production GPU rig) → $50,000+/month (multi-GPU cloud rental) → $200,000+/year (dedicated infrastructure team)

Self-hosting AI means running models on hardware you own or rent, rather than paying a vendor — OpenAI, Anthropic, Google — per request through their API (application programming interface — the way one piece of software calls another). Open-source models like Llama, Mistral, and Qwen have made this a viable option for plenty of workloads.

The “we should self-host to save money” decision has been the source of more under-cooked architectural choices than perhaps any other in modern infrastructure. The headline savings look obvious: per-token API costs add up at volume, GPUs are buyable, open-source models are competitive. The hidden costs — engineering time, ops overhead, capacity planning, model upgrades, electricity, depreciation — often eat the savings and produce a worse product at the same total cost. The opposite mistake is just as common: teams committed to cloud APIs at scales where self-hosting would save them five-to-ten times the infrastructure cost.

The honest answer is workload-by-workload, calculated on real numbers. This piece is the framework.

The mental model

The four cost components nobody fully accounts for

Self-hosting AI has four cost dimensions:

  • Capital cost. Hardware: GPUs, server, networking, storage. For meaningful AI inference at scale, this is $5,000–$80,000+ depending on capability. A single H100 server with proper supporting infra is $50,000+; consumer-grade workstations (RTX 4090 + workstation) can handle smaller models at $5,000–$10,000.

  • Operating cost. Electricity, cooling, hosting (if colocated rather than on-prem). A GPU server under load draws 400–700W continuously; at $0.10–$0.30 per kWh, the electricity is $300–$1,500/year per GPU. Colocation adds another $200–$500/month for proper datacenter conditions.

  • Engineering cost. The biggest hidden cost. Building and maintaining production inference infrastructure — deployment, monitoring, fail-over, load balancing, model upgrades, performance tuning — is a real engineering function. Conservatively 0.25–1 FTE for a small operation; full team for serious scale. At $150,000–$300,000/year per engineer, this dominates the math for small deployments.

  • Capacity-planning cost. Cloud APIs scale automatically; self-hosted does not. Over-provisioning is expensive (idle GPUs costing money); under-provisioning produces slow / failed inference under load. The capacity-planning work is ongoing, and getting it wrong has visible product consequences.

The first two are easy to calculate. The second two are where most break-even analyses go wrong; they’re treated as zero in the optimistic-self-host case and as infinite in the optimistic-API case. Both are wrong; the real numbers are in between.

The break-even framework

When self-hosting actually pays off

For a specific workload, calculate:

  • API cost. Tokens per request × requests per month × API price per token. Be honest about the model tier you’d need (flagship vs cheap).
  • Self-host capacity needed. The hardware to handle peak load (not average — peak; or accept slow response at peak).
  • Self-host capital amortised. Hardware cost / expected useful life (typically 3 years for GPUs).
  • Self-host operating cost. Electricity, hosting, monitoring infrastructure.
  • Self-host engineering allocation. Realistic FTE percentage × loaded salary.

Self-hosting wins when the self-host total is below the API total. Typical break-even thresholds:

  • Below $1,000–$2,000/month in API costs: Cloud almost always wins. Engineering overhead alone exceeds the saving.
  • $5,000–$15,000/month API costs: Borderline. Depends on workload predictability (high variance favours cloud), engineering capacity, and how much the workload would benefit from customisation only possible on self-hosted.
  • Above $25,000/month API costs: Self-hosting often wins for predictable workloads. The capital cost amortises against the API spend within months.
  • Above $100,000/month API costs: Self-hosting almost always wins for the predictable bulk of the workload. Hybrid is common (self-host for the bulk, API for spike capacity).

The break-even isn’t a single number; it depends on workload characteristics. Run it on your specific numbers.

The numbers

Realistic component costs

Entry-tier GPU workstation (RTX 4090, can run ~70B-class models) $5,000–$10,000 capital, 3-year useful life
Production single-GPU server (A100 / H100) $25,000–$50,000 capital plus rack, networking, storage
Multi-GPU production rig (8x H100 / H200) $200,000–$500,000+ capital
Cloud GPU rental — H100 instance $3–$8 per hour ($2,200–$5,800 per month at continuous use)
Electricity per GPU server (continuous operation) $30–$100 per month
Colocation cost (small rack with GPU server) $500–$2,000 per month
Engineering time — small self-host operation 0.25–1 FTE; $40,000–$300,000 per year loaded
API inference cost — small/mid-tier model $0.10–$1 per million input tokens
API inference cost — flagship model $3–$15 per million input tokens; $15–$75 per million output
Break-even where self-host typically wins Above ~$5,000–$15,000/month in flagship-API spend for predictable workloads
Engineering effort to deploy production self-hosted inference 2–6 weeks initial setup; 0.25–1 FTE ongoing

The capital cost amortised over 3 years is often smaller than expected; the ongoing engineering cost is often larger than expected. Calculate both honestly.

Common mistakes

Where the calculation typically goes wrong

Ignoring engineering time entirely. The most common error. Self-hosting requires ongoing engineering investment; pretending it doesn’t produces a calculation that favours self-hosting at scales where it shouldn’t.

Comparing against the wrong API tier. Some workloads need flagship models; some don’t. Compare against the realistic API tier for your workload, not the cheapest available.

Over-provisioning for hypothetical peak. Provisioning self-hosted infrastructure for “what if usage 10x’s overnight” produces expensive idle capacity. Plan for realistic growth, not the imagination.

Underestimating capability gaps. Open-source models are catching up to flagship proprietary but still trail on the hardest reasoning. Self-hosting saves money on routine work; using self-hosted for the workloads that genuinely need flagship capability produces a worse product.

Treating it as a permanent decision. The math changes. Re-run it annually; capability and pricing both shift faster than self-hosting infrastructure decisions tend to be revisited.

What's next

Related work

For the broader open-source-vs-proprietary framework, see Open-source vs proprietary AI — practical tradeoffs. For the local-vs-cloud architectural choice, see When to run AI locally vs in the cloud. For the underlying token-and-cost mechanics, see Tokens, context windows, and what they cost. For the broader vendor-lock-in framework that interacts with self-hosting, see Vendor lock-in risks with AI.

Common questions

FAQ

What about renting GPUs on the cloud instead of buying?

Cloud GPU rental is the middle path. Higher per-hour cost than owning ($3–$8/hour for H100 vs amortised $1–$2/hour for owned); lower commitment than buying; better-than-API economics at high volume. Useful for: testing self-hosting before committing to hardware, handling variable workloads, accessing GPU capacity for short-term needs. Less economical than owning for steady high-volume use.

How do we handle model upgrades when self-hosting?

Manually. Open-source models release new versions on the lab's cadence (Llama every 6–12 months, Mistral similar). You update your deployed model, re-test on your workload, switch over. This is engineering work that's invisible when using APIs; the version-management overhead is one of the hidden costs of self-hosting.

Should we use specialised inference providers (Replicate, Together, Fireworks) instead of self-hosting?

Reasonable middle path. These providers host open-source models for you, charging per-call rather than per-hour. Cheaper than running idle GPU capacity; more expensive than self-hosting at steady high volume. Often the right answer for teams that want open-source models without operating the infrastructure.

What about using inference-optimisation tools (vLLM, TensorRT, llama.cpp)?

Materially affects the math. vLLM and similar tools improve inference throughput 2–4x on the same hardware, effectively making each GPU much more productive. Include these in your capacity-planning calculation; without them, you'd need more hardware than the optimised case requires.

Sources & references

Change history (1 entry)
  • 2026-05-13 Initial publication.