Cyberax AI Playbook
cyberax.com
How-to · Communications & Customer Work · Local-OK

Transcribe audio at scale on a local machine

A self-hosted transcription pipeline that turns audio into text on your own GPU — no cloud API, no per-minute charges, no audio leaving the machine. With sizing notes for hardware, batching strategy, and the parts that bite teams in production.

Updated · Apr 22, 2026
Whisper large-v3 is now the recommended model over the older medium-en variant for English content. Quality bump is meaningful enough to justify the slightly higher VRAM.
At a glance Last verified · May 2026
Problem solved Transcribe audio on your own hardware without sending it to a cloud API — for privacy, cost, or both
Best for Privacy-sensitive teams, high-volume transcription workloads, cost-bound operations
Tools Whisper.cpp, ffmpeg, Python
Hardware NVIDIA GPU 8GB+ or 32GB Apple Silicon
Difficulty Intermediate
Cost $0 once hardware is owned
Time to set up ~3 hours
Throughput ~6× real-time on RTX 4090
When to use

Where this solution fits — and where it doesn't

Use this if you’re transcribing meaningful volumes of audio (interviews, podcasts, internal recordings, customer calls) and you’d rather not stream every minute of it to a third-party API. The economics tip toward local around 30 hours of audio per week, but privacy concerns can make local the right call at any volume.

Don’t use this if you have one-off transcription needs (use OpenAI’s gpt-4o-mini-transcribe at $0.003/min, gpt-4o-transcribe at $0.006/min, or Deepgram, and skip the setup), or if your team can’t keep a Linux box running reliably.

Prerequisites

What you'll need before starting

  • An NVIDIA GPU with 8GB+ VRAM, or an Apple Silicon Mac with 32GB+ unified memory.
  • Linux or macOS (Windows works via WSL2 but adds friction).
  • ~10GB disk space for model weights.
  • Comfort installing dependencies via brew / apt and running shell scripts.
The solution

Five steps to a working pipeline

  1. Install Whisper.cpp and download a model

    Build whisper.cpp from source — it’s the C++ port that runs noticeably faster than the original Python implementation. Pull the large-v3-turbo weights for faster inference at near-equivalent quality, or large-v3 for the highest accuracy. Both are on the project’s release page.

  2. Pre-process audio with ffmpeg

    Convert input audio to 16kHz mono WAV before passing to the model. The model expects this format, and skipping the conversion silently degrades quality.

  3. Run a benchmark on representative audio

    Before you wire this into production, transcribe a representative sample of your actual audio. WER (word error rate) on benchmark datasets does not predict your specific use case.

  4. Batch and queue — don’t process one file at a time

    Throughput on a single GPU is dominated by model load time and warm-up. Process files in batches of 5–10 to amortise that cost.

  5. Plan for the long tail

    Some audio files will fail or produce garbage output. Speakers with thick accents, music behind speech, two people talking over each other — Whisper has known weaknesses. Build a “needs human review” path from day one.

Cost & constraints

What this actually costs to run

The hardware is the cost. Once owned, transcription is essentially free.

Hardware investment $1,800–$2,500 (RTX 4090 + workstation)
Equivalent cloud cost $0.003–$0.006/min (gpt-4o-mini-transcribe / gpt-4o-transcribe) · ~$108–$216/mo at 600hr/mo
Power draw under load ~350W (≈ $25/mo continuous)
Break-even vs cloud ~9 months at 600hr/mo audio
Practical throughput ~360 audio-hours per machine-day
Alternatives

Other ways to solve this

OpenAI transcription APIgpt-4o-mini-transcribe at $0.003/min or gpt-4o-transcribe at $0.006/min — is the right answer for low-volume teams. No setup, no hardware, predictable cost. Switch to local when you cross ~30hr/week.

Deepgram is faster and offers diarisation (speaker separation) out of the box. Worth it if speaker labels matter.

Cloud-hosted Whisper on Modal, Replicate, or Fal — middle ground between API and self-hosting at $0.50–$1.50 per hour of audio.

Common questions

FAQ

Will Apple Silicon really run this well?

The M2 Pro with 32GB and the M3 Max with 64GB+ run large-v3 at roughly half the speed of an RTX 4090 — fast enough for most teams. Smaller M-series chips will struggle with longer files.

What about non-English audio?

Whisper handles 90+ languages, with very different quality levels. English, Spanish, French, German, and Mandarin are excellent. Other languages range from good to weak. Test on your specific language before committing.

How does this compare to ElevenLabs Scribe or AssemblyAI?

Closer than you'd expect on accuracy for English; Whisper local pulls ahead on languages outside the top tier. Both commercial alternatives offer better diarisation than self-hosted Whisper today.

Sources & references

Change history (3 entries)
  • 2026-04-22 Updated to recommend Whisper large-v3 over medium-en for English. Throughput numbers updated for new model.
  • 2026-02-08 Added Apple Silicon sizing notes; clarified break-even math.
  • 2025-11-30 Initial publication.