Cyberax AI Playbook
cyberax.com
Comparison · Tool Decisions

Whisper API vs Deepgram vs AssemblyAI

Three transcription APIs that look interchangeable on the marketing page and diverge sharply on what they're actually good at. Where each one wins, where each one quietly loses, and the decision rule that fits your audio.

At a glance Last verified · May 2026
Problem solved Pick a transcription API for production audio workloads — comparing accuracy, diarisation, real-time, languages, pricing, and the audio-intelligence features that increasingly differentiate the category
Best for Product teams shipping transcription features, ops teams building voice pipelines, anyone choosing between OpenAI Whisper API, Deepgram, and AssemblyAI
Tools OpenAI Whisper API, Deepgram, AssemblyAI
Difficulty Intermediate
Cost $0.003–$0.006/min (OpenAI Whisper API) · $0.0043–$0.0077/min (Deepgram Nova-3 tiers) · $0.002–$0.0062/min (AssemblyAI Universal-2, slim to full audio-intelligence bundle)

A transcription API takes audio in and returns text out. The API (application programming interface — the way one piece of software calls another) is the same shape across vendors. The output isn’t. OpenAI’s Whisper API, Deepgram, and AssemblyAI are the three that show up on every shortlist in 2026. All three claim “high accuracy,” “real-time streaming,” and “100+ languages.” They diverge sharply once you put real audio through them.

The differences that actually matter: diarisation quality (who said what), performance on accented speech, pricing at scale, and the audio-intelligence add-ons (sentiment, topics, PII redaction, summarisation) that increasingly differentiate the category.

This piece walks through the side-by-side comparison plus the decision rules for which API fits which audio workload. Snapshot is current as of May 2026; the category moves quickly — see the change log for the freshness check.

Side by side

The comparison matrix

OpenAI Whisper APIDeepgram (Nova-3)AssemblyAI (Universal-2)
Architecture Transformer-based, descended from Whisper; closed weights via API. Two tiers — gpt-4o-transcribe and gpt-4o-mini-transcribe.Proprietary end-to-end speech model; ASR-specialist company. Closed weights.Proprietary multilingual speech model; ASR-specialist company. Closed weights.
English accuracy on clean audio (WER, lower is better) ~6–8% WER on clean English benchmarks; among the strongest published numbers~6–10% WER on clean English; strong, close to OpenAI on most benchmarks~7–10% WER on clean English; close to the others on the same conditions
Accuracy on accented or noisy speech Strong on accented English; the broad Whisper pre-training showsStrong; Deepgram has invested heavily in noisy-environment performanceStrong; less audited externally, but competitive in published benchmarks
Speaker diarisation Limited — no built-in diarisation on the API; bring your own pipeline (pyannote, NeMo)Strong — built-in diarisation; high accuracy on 2–4 speakers, degrades past 6Strong — built-in diarisation; comparable to Deepgram on typical workloads
Real-time / streaming transcription Streaming supported on newer gpt-4o-transcribe; latency tuned for interactive useFirst-class streaming with low latency; long-time strength of the platformStreaming supported; latency competitive but not always the strongest
Languages supported 99 languages with varying quality; English / Spanish / French / German / Mandarin / Japanese strongest~36+ languages on Nova-3; smaller set than OpenAI but higher quality bar per language99+ languages on Universal-2; broad coverage matching OpenAI
Custom vocabulary / term boost Hint-prompts at request time; less granular than competitorsStrong — keyword boosting and custom-vocabulary support documented and provenStrong — custom vocabulary and word-boost features competitive with Deepgram
Audio-intelligence add-ons (sentiment, topics, summarisation, redaction) Not bundled — pair with a separate LLM call for theseSome bundled (intent detection, sentiment); narrower than AssemblyAIStrongest — sentiment, topics, content safety, summarisation, entity detection, PII redaction all bundled
PII redaction Not bundled; bring your own redaction layerAvailable as a feature flagAvailable — one of the differentiators on regulated workloads
Pricing — entry tier per minute $0.003/min (gpt-4o-mini-transcribe) · $0.006/min (gpt-4o-transcribe)$0.0043/min (Nova-3 pre-recorded) · $0.0077/min (Nova-3 streaming)$0.002/min (Slim Universal-2) · $0.0062/min (Universal-2 + audio intelligence)
Pricing — high-volume / enterprise Volume discounts available via Enterprise agreement; less transparent posted tiersPosted volume tiers; competitive at scale; self-hosted option for very-high volumePosted volume tiers; competitive on bundled audio-intelligence workloads
Data residency / compliance US-based; SOC 2 + HIPAA available on enterprise tier; data excluded from training by default on APIUS-based; SOC 2 + HIPAA; on-prem deployment available for regulated industriesUS-based; SOC 2 + HIPAA; enterprise-tier compliance posture
API maturity / ecosystem Excellent — broad SDK coverage, large community, well-documentedExcellent — long-time ASR specialist with mature SDKs and integrationsExcellent — SDKs in major languages, strong docs, growing ecosystem
Free credit on signup No free tier; pay-as-you-go from first second$200 free credit on signup; one-time$50 free credit on signup; one-time
The decision

What to actually use

For pure transcription of clean English audio at low to moderate volume — OpenAI Whisper API. The gpt-4o-mini-transcribe tier at $0.003/min is the most cost-efficient flagship-quality option, the accuracy is among the best, and the integration is straightforward. The trade-off: no built-in diarisation and no audio-intelligence bundle. If you don’t need those, OpenAI is the simplest path.

For multi-speaker audio (interviews, meetings, customer calls) where diarisation matters — Deepgram or AssemblyAI. OpenAI’s lack of native diarisation means you’d build a separate pipeline (pyannote.audio, NeMo); both Deepgram and AssemblyAI ship diarisation as a flag in the API. Deepgram has a slight edge on diarisation latency for real-time use; AssemblyAI has a slight edge on diarisation quality for very long recordings. Test on a representative sample before committing.

For audio-analytics workloads (sentiment, topics, summarisation, PII redaction) bundled with transcription — AssemblyAI. The audio-intelligence bundle is the broadest in the category, and bundling means one API call instead of a transcription call plus a separate LLM call for analytics. The savings on engineering integration time are real; the per-minute cost premium is modest at typical volumes. For regulated workloads where PII redaction is mandatory, AssemblyAI is often the fastest path to compliance.

For low-latency real-time transcription (live captioning, voice agents, real-time meeting transcription) — Deepgram. Long-standing strength of the platform; their streaming latency consistently benchmarks at the low end of the category. OpenAI’s streaming has improved; AssemblyAI’s is competitive. If real-time is the primary requirement, Deepgram is the safe pick.

For multilingual transcription with broad language coverage — OpenAI Whisper API. 99 languages with the Whisper architecture’s broad pre-training advantage; quality holds further down the long-resource list than Deepgram’s narrower Nova-3 language set. AssemblyAI’s Universal-2 also supports 99+ languages and is competitive — test on your specific languages.

For very high volume at cost-sensitive workloads (more than 1,000 audio-hours per month) — Deepgram has the best per-minute economics on the Nova-3 base tier and offers self-hosted deployment for the highest-volume customers. AssemblyAI’s Slim tier is also competitive at high volume. At this volume, also consider self-hosting Whisper — see Transcribe audio at scale on a local machine for the break-even math.

The numbers

What you'll actually pay

OpenAI gpt-4o-mini-transcribe $0.003 per minute — cheapest flagship-quality option
OpenAI gpt-4o-transcribe $0.006 per minute — slightly higher accuracy on the same audio
Deepgram Nova-3 pre-recorded $0.0043 per minute base; volume discounts available
Deepgram Nova-3 streaming $0.0077 per minute base; volume discounts available
AssemblyAI Slim (transcription only) $0.002 per minute — cheapest entry tier in the comparison
AssemblyAI Universal-2 + audio intelligence $0.0062 per minute including PII redaction, sentiment, summarisation, etc.
Typical workload cost — 600 audio-hours/month $108 (OpenAI mini) · $155 (Deepgram pre-recorded) · $72–$223 (AssemblyAI, slim to full bundle)
Break-even vs self-hosted Whisper (RTX 4090 + workstation) About 9 months at 600hr/mo audio — see Transcribe locally for full math
Free credit on signup Deepgram: $200 · AssemblyAI: $50 · OpenAI: none
Latency — pre-recorded API OpenAI / Deepgram / AssemblyAI all return within ~1.0–2.5× audio duration for typical workloads
Latency — streaming Deepgram leads at sub-300ms typical; OpenAI and AssemblyAI competitive but slightly higher

The headline pricing is close enough across the three that cost shouldn’t drive the decision below 1,000 hours/month. Above that, the per-minute differences compound — and self-hosting Whisper becomes worth re-evaluating against any of the three.

What changes between now and the next refresh

Volatility notes

This category moves quickly. Concrete watch-list for the next two quarters:

  • OpenAI’s transcription pricing. OpenAI cut Whisper API pricing materially in late 2025 with the gpt-4o-mini-transcribe launch; further cuts likely as inference economics improve.
  • Deepgram’s Nova series. Nova-3 shipped in 2025; expect a Nova-4 announcement on a roughly annual cadence with accuracy and language improvements.
  • AssemblyAI’s audio-intelligence bundle. The most distinctive feature set is also the one competitors are most likely to replicate; expect Deepgram and OpenAI to bundle more audio-analytics features over 2026.
  • Open-source contenders. Distil-Whisper, Faster-Whisper, and the active research community around Whisper variants mean self-hosting keeps getting cheaper and faster — the break-even point moves left every six months.
  • Real-time / voice-agent integration. The transcription APIs are increasingly bundled with voice-agent platforms (OpenAI Realtime API, Deepgram Voice Agent). The “transcription-only API” market may not look the same a year from now.

Re-verify every 3–6 months. Pricing and audio-intelligence feature lists drift the fastest.

What's next

Related work

For the self-hosted Whisper alternative at high volume, see Transcribe audio at scale on a local machine. For the specific case of transcribing sales calls and customer interviews, see Voice transcription for sales calls and customer interviews. For drafting replies from the transcribed content, see Draft customer support replies that hold up to scrutiny. For multilingual content where translation matters more than the source transcription, see AI translation services compared.

Common questions

FAQ

Can I trust the WER numbers each vendor publishes?

Vendor-published WER numbers should be treated as upper bounds, not predictions. Each vendor benchmarks on the audio that flatters their model — typically clean, US-accented, single-speaker reads. Real-world WER on your audio is what matters. Run a small sample of your actual content through each candidate and measure manually — 30 minutes of representative audio is enough to spot the meaningful differences.

Should I just use OpenAI for everything since I already use their other APIs?

It's a reasonable default for transcription-only workloads on clean audio. It stops being the right default when you need built-in diarisation, audio-intelligence bundling, or PII redaction — those are real engineering work to layer on top of the OpenAI API, and you'd be reinventing what Deepgram or AssemblyAI ship out of the box. Pick based on the audio shape, not on existing vendor relationships.

How does this compare to running Whisper locally?

Local Whisper has the best economics above 30 hours/week of audio (see transcribe audio at scale on a local machine for the break-even math). At lower volume, the cloud APIs are simpler and the cost difference is negligible. Local also gives you data control — for privacy-sensitive audio (interviews, legal recordings, customer calls), local is sometimes the only acceptable answer regardless of cost. The Whisper open-weights model is the same architecture behind the OpenAI API, so accuracy is comparable; you trade convenience for control.

What about diarisation accuracy for 5+ speakers?

All three degrade past 4 speakers. Deepgram and AssemblyAI are competitive in the 2–4 range; both drop noticeably above 6. For specialised multi-speaker workloads (board meetings, multi-party interviews), consider a hybrid approach — diarise with a specialised pipeline (pyannote.audio) and transcribe with the cheapest accurate API. The split-pipeline often outperforms any single API on hard diarisation cases.

Which one is best for non-English languages?

Test on your specific language; don't trust the marketing-page numbers. OpenAI Whisper has the broadest pre-training and tends to win on long-resource languages outside the top ten. AssemblyAI Universal-2 supports a similar breadth. Deepgram Nova-3's language list is narrower but quality per supported language tends to be high. For low-resource languages (Yoruba, Khmer, Quechua, etc.), all three drop materially — budget for human review on published content.

Does any of them support speaker identification — knowing who specifically is talking?

All three support speaker diarisation (separating speakers labelled "speaker 0, 1, 2"), not speaker identification (knowing that speaker 0 is specifically "Alex Chen"). Identification requires a known voice profile per speaker and an additional matching layer — Deepgram has some support for this; AssemblyAI's identification features are growing; OpenAI doesn't ship it. For workflows that need named-speaker output (sales calls, interviews), the identification layer is usually built on top of whichever transcription API you choose.

Sources & references

Change history (1 entry)
  • 2026-05-13 Initial publication.