Cyberax AI Playbook
cyberax.com
Explainer · Tool Decisions · Local-OK

When to run AI locally vs in the cloud

If you're an engineering leader or founder whose AI bill is starting to matter, the "self-host or use the cloud?" decision rarely has one right answer. A practical framework with break-even math, the categories of workload that belong on each side, and the hybrid pattern most teams land on in 2026.

At a glance Last verified · May 2026
Problem solved Decide which AI workloads to run on your own hardware vs send to a cloud API — without falling for either the "always cloud" or "always local" framing
Best for Engineering leaders, founders making infra calls, ops leads with privacy or cost constraints, anyone whose AI bill is starting to matter
Tools Ollama, LM Studio, Llama, OpenAI API, Anthropic API
Difficulty Intermediate

If you’ve been weighing whether to self-host AI, the conversation usually polarises into two camps. One side sees the per-token API bill at scale and assumes self-hosting is the answer. The other points to the operational complexity and the model-quality gap and assumes the cloud is always right.

Each side wins specific categories of workload. The best architecture in 2026 is hybrid — most of the work in the cloud, specific workloads on your hardware.

This piece is a decision framework, workload by workload. It pairs naturally with Tokens, context windows, and what they cost — that piece sets up the unit economics; this one tells you when to vary them.

The decision rule

A small decision rule, then the math

Default to cloud APIs. For most teams, most workloads, the right starting point is a cloud API (OpenAI, Anthropic, Google, or one of the open-weight hosts like Together AI / Groq / Fireworks). Reasons: you get the strongest available models, no operational burden, scaling is the vendor’s problem, and the cost is small until volume is high.

Move specific workloads local when all three apply: (a) the workload is high-volume and predictable (you can forecast tokens-per-day), (b) the per-token cost at cloud pricing is becoming a real line item (rough threshold: $1,000+/month on that one workload), and (c) you can use an open-weight model (Llama, Qwen, Mistral, DeepSeek) without unacceptable quality loss. Common fits: classification, embedding, summarisation of standard documents, simple structured extraction.

Move everything local when you have a hard data-residency or privacy requirement that makes cloud APIs unsuitable — regulated content (PHI, classified, attorney-client), specific jurisdictional constraints, customer contracts that prohibit third-party processing. The cost of self-hosting is justified by the binary nature of the constraint, not by the math.

Everything else is hybrid in practice.

Side by side

Local vs cloud across the dimensions that matter

Local (self-hosted)Cloud API
Best model quality available Open-weight frontier (Llama 4, Qwen 3, DeepSeek-V3) — strong but lags proprietary frontier on prose and reasoningProprietary frontier (Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro) — current best across most tasks
Time to first working call Hours to days (install runtime, download model, configure)Minutes (sign up, get API key, send request)
Per-call cost Approximately electricity — fractions of a cent per call on consumer GPUVendor pricing — $0.0001 to $0.05 per call typical
Up-front capital $1,500–$10,000 for a workstation; $20k–$100k+ for a server-grade GPU$0
Operational burden Real — patching, monitoring, hardware failure, model updatesNone on your side
Scaling to 10x volume Buy more hardware, queue management, capacity planningAdd a credit card update
Data privacy posture Strongest — nothing leaves your infrastructureVendor-dependent (see AI privacy explainer)
Latency Predictable; depends on local hardware. Usually faster for small modelsVariable; depends on vendor load and your network. Usually faster for large models
Internet dependency None for inferenceRequired
Best for Privacy-bound work, high-volume predictable workloads, sub-tasks (classification, embedding) where model quality is sufficientNew projects, frontier-model needs, variable load, anything you'd struggle to predict
The numbers

The break-even math, in May 2026

RTX 4090 workstation cost (consumer GPU, 24GB VRAM) $1,500–$2,500 fully built
Electricity cost — under load (US average) ~$25/month at $0.12/kWh, continuous use
Electricity cost — under load (EU average) ~$50–$60/month at $0.25–$0.30/kWh
Break-even vs $200/month API bill ~8 months on consumer hardware
Break-even vs frontier-model cloud API at medium volume (3–5M tokens/day) 3–8 months depending on which model is being replaced
Break-even vs cheapest open-weight hosted APIs (Together / Groq / Fireworks) Much longer — often 24–36 months at 15–20M tokens/day
Quality gap — open-weight frontier vs proprietary frontier Closed on factual / reasoning benchmarks; modest gap remains on prose, edge cases, multimodal
Real-world hybrid result (fintech case study, 2025) Bill cut from $47k to $8k/month (83%) — predictable workloads moved to self-hosted, frontier kept on cloud
Server-grade GPU (NVIDIA H100 — for higher-volume workloads) $25k–$40k street; on rental, $2–$4/hour
Apple Silicon as a serious option (M3 Ultra, M4 Max with 64GB+ unified memory) $3k–$8k; runs Llama 70B-class models comfortably

The key insight in the numbers: break-even is fast against the most expensive cloud option (proprietary frontier on a flagship model), and slow against the cheapest (open-weight hosted via Together / Groq / Fireworks). Many teams that move workloads local discover they could have used a hosted open-weight model and skipped the operational burden. Check the cheaper hosted alternative before buying hardware.

The hybrid pattern

What most teams actually land on

The modal architecture in 2026 for serious AI deployments is not “all cloud” or “all local” — it’s hybrid, with a routing layer that sends each request to the right tier:

  • Frontier-model needs (complex reasoning, novel tasks, anything user-facing where quality matters most) → cloud API to Claude Opus 4.7 / GPT-5.5 / Gemini 3.1 Pro.
  • Predictable, high-volume workloads (classification, embedding, structured extraction on known document types) → self-hosted open-weight model on your hardware.
  • Privacy-bound subset (regulated data, NDAs) → self-hosted local, regardless of cost-efficiency.
  • Variable / overflow load → cloud API as the elastic tier; local handles steady-state.

The fintech case study from 2025 (cited in the stats above) is illustrative: by moving the predictable steady-state work — classification of incoming emails, summarisation of standard reports, embedding of internal docs — onto self-hosted infrastructure, the team cut 83% of their cloud bill. The remaining 17% kept access to frontier models for the work where quality mattered most.

Building this routing layer is non-trivial — you’re effectively running a small ML platform — but for any team where the AI bill has crossed ~$10k/month, the engineering cost pays back inside two quarters.

When local is wrong

The argument against premature self-hosting

Three patterns where teams build self-hosted AI when they shouldn’t have:

  • Self-hosting for cost savings on volume that doesn’t exist yet. A 50-call-per-day workload doesn’t justify a workstation. Build on cloud, validate the workload is real and is going to scale, then migrate. The migration cost is small relative to the cost of running unused hardware.
  • Self-hosting because “models will commoditise.” They will, but the operational gap won’t. Running inference reliably at scale is a real engineering function — model availability isn’t the constraint. If you’re not already running ML infrastructure, adding it is a meaningful step up in operational complexity.
  • Self-hosting to “own our AI.” Owning the inference doesn’t make the AI yours. Owning the data, the prompts, the evaluation harness, and the product wrapped around them is what creates lock-in resistance. Self-host where it matches the workload; rely on cloud where it matches better.
Common questions

FAQ

What's the simplest local setup if I want to try this?

Ollama on a Mac with 32GB+ unified memory or a Windows/Linux box with an RTX 4090. Install in five minutes, pull a Llama 4 Scout or Qwen 3 model, run it from the command line. You'll have a working local LLM in under an hour. Use it for non-critical work to calibrate model quality before moving anything important.

How does open-weight model quality actually compare to proprietary frontier?

Closed much of the gap on factual recall, mathematical reasoning, and code generation — modern Llama 4, Qwen 3, and DeepSeek-V3 are competitive with proprietary mid-tier on these benchmarks. Modest gap remains on prose quality, multilingual nuance outside top-tier languages, and edge-case reasoning. For automated pipelines (classification, extraction, summarisation), open-weight is usually fine. For user-facing creative or analytic work, the proprietary frontier still wins.

Should I rent a cloud GPU instead of buying one?

Often yes — H100 rental is $2–$4/hour on multiple providers (RunPod, Lambda, Vast.ai, Hyperstack), and you only pay for actual usage. For workloads that are bursty or experimental, rental beats ownership. Buying makes sense when usage is sustained, electricity is cheap, and you have somewhere reliable to run the hardware.

What about Mac Studio / Apple Silicon for serious local AI?

An M3 Ultra or M4 Max with 64GB+ unified memory runs Llama-70B-class models comfortably and is the quietest, lowest-electricity option for sustained inference at home or in a small office. Performance per watt is excellent. The trade-off is throughput on very high concurrent loads, where NVIDIA still wins. For a single-developer or small-team setup, Apple Silicon is often the right call.

Does running AI locally actually help with privacy?

Yes, materially — but only if your stack is fully local. Local model + cloud embedding API still sends your data to the cloud at the embedding step. For full data privacy, run open-weight embedding models (e.g. bge-large-en-v1.5, nomic-embed-text) locally as well. The combined effect is end-to-end on-prem; the cost is more setup and slightly weaker embedding quality.

Is the answer different for a startup vs an established company?

Yes. Startups should default cloud almost universally — operational simplicity matters more than infrastructure cost when you're 5 people; hardware capital is wrong for an organisation that pivots quarterly; the team's time should be on the product, not on inference reliability. Established companies with predictable workloads and existing infrastructure teams have more reasons to consider hybrid earlier — the operational cost is amortised, the privacy posture often matters more, and the engineering function exists to maintain it.

Sources & references

Change history (1 entry)
  • 2026-05-10 Initial publication.