Whisper API vs Deepgram vs AssemblyAI

A transcription API takes audio in and returns text out. The API (application programming interface — the way one piece of software calls another) is the same shape across vendors. The output isn’t. OpenAI’s Whisper API, Deepgram, and AssemblyAI are the three that show up on every shortlist in 2026. All three claim “high accuracy,” “real-time streaming,” and “100+ languages.” They diverge sharply once you put real audio through them.

The differences that actually matter: diarisation quality (who said what), performance on accented speech, pricing at scale, and the audio-intelligence add-ons (sentiment, topics, PII redaction, summarisation) that increasingly differentiate the category.

This piece walks through the side-by-side comparison plus the decision rules for which API fits which audio workload. Snapshot is current as of May 2026; the category moves quickly — see the change log for the freshness check.

Side by side

The comparison matrix

	OpenAI Whisper API	Deepgram (Nova-3)	AssemblyAI (Universal-2)
Architecture	Transformer-based, descended from Whisper; closed weights via API. Two tiers — gpt-4o-transcribe and gpt-4o-mini-transcribe.	Proprietary end-to-end speech model; ASR-specialist company. Closed weights.	Proprietary multilingual speech model; ASR-specialist company. Closed weights.
English accuracy on clean audio (WER, lower is better)	~6–8% WER on clean English benchmarks; among the strongest published numbers	~6–10% WER on clean English; strong, close to OpenAI on most benchmarks	~7–10% WER on clean English; close to the others on the same conditions
Accuracy on accented or noisy speech	Strong on accented English; the broad Whisper pre-training shows	Strong; Deepgram has invested heavily in noisy-environment performance	Strong; less audited externally, but competitive in published benchmarks
Speaker diarisation	Limited — no built-in diarisation on the API; bring your own pipeline (pyannote, NeMo)	Strong — built-in diarisation; high accuracy on 2–4 speakers, degrades past 6	Strong — built-in diarisation; comparable to Deepgram on typical workloads
Real-time / streaming transcription	Streaming supported on newer gpt-4o-transcribe; latency tuned for interactive use	First-class streaming with low latency; long-time strength of the platform	Streaming supported; latency competitive but not always the strongest
Languages supported	99 languages with varying quality; English / Spanish / French / German / Mandarin / Japanese strongest	~36+ languages on Nova-3; smaller set than OpenAI but higher quality bar per language	99+ languages on Universal-2; broad coverage matching OpenAI
Custom vocabulary / term boost	Hint-prompts at request time; less granular than competitors	Strong — keyword boosting and custom-vocabulary support documented and proven	Strong — custom vocabulary and word-boost features competitive with Deepgram
Audio-intelligence add-ons (sentiment, topics, summarisation, redaction)	Not bundled — pair with a separate LLM call for these	Some bundled (intent detection, sentiment); narrower than AssemblyAI	Strongest — sentiment, topics, content safety, summarisation, entity detection, PII redaction all bundled
PII redaction	Not bundled; bring your own redaction layer	Available as a feature flag	Available — one of the differentiators on regulated workloads
Pricing — entry tier per minute	$0.003/min (gpt-4o-mini-transcribe) · $0.006/min (gpt-4o-transcribe)	$0.0043/min (Nova-3 pre-recorded) · $0.0077/min (Nova-3 streaming)	$0.002/min (Slim Universal-2) · $0.0062/min (Universal-2 + audio intelligence)
Pricing — high-volume / enterprise	Volume discounts available via Enterprise agreement; less transparent posted tiers	Posted volume tiers; competitive at scale; self-hosted option for very-high volume	Posted volume tiers; competitive on bundled audio-intelligence workloads
Data residency / compliance	US-based; SOC 2 + HIPAA available on enterprise tier; data excluded from training by default on API	US-based; SOC 2 + HIPAA; on-prem deployment available for regulated industries	US-based; SOC 2 + HIPAA; enterprise-tier compliance posture
API maturity / ecosystem	Excellent — broad SDK coverage, large community, well-documented	Excellent — long-time ASR specialist with mature SDKs and integrations	Excellent — SDKs in major languages, strong docs, growing ecosystem
Free credit on signup	No free tier; pay-as-you-go from first second	$200 free credit on signup; one-time	$50 free credit on signup; one-time

The decision

What to actually use

For pure transcription of clean English audio at low to moderate volume — OpenAI Whisper API. The gpt-4o-mini-transcribe tier at $0.003/min is the most cost-efficient flagship-quality option, the accuracy is among the best, and the integration is straightforward. The trade-off: no built-in diarisation and no audio-intelligence bundle. If you don’t need those, OpenAI is the simplest path.

For multi-speaker audio (interviews, meetings, customer calls) where diarisation matters — Deepgram or AssemblyAI. OpenAI’s lack of native diarisation means you’d build a separate pipeline (pyannote.audio, NeMo); both Deepgram and AssemblyAI ship diarisation as a flag in the API. Deepgram has a slight edge on diarisation latency for real-time use; AssemblyAI has a slight edge on diarisation quality for very long recordings. Test on a representative sample before committing.

For audio-analytics workloads (sentiment, topics, summarisation, PII redaction) bundled with transcription — AssemblyAI. The audio-intelligence bundle is the broadest in the category, and bundling means one API call instead of a transcription call plus a separate LLM call for analytics. The savings on engineering integration time are real; the per-minute cost premium is modest at typical volumes. For regulated workloads where PII redaction is mandatory, AssemblyAI is often the fastest path to compliance.

For low-latency real-time transcription (live captioning, voice agents, real-time meeting transcription) — Deepgram. Long-standing strength of the platform; their streaming latency consistently benchmarks at the low end of the category. OpenAI’s streaming has improved; AssemblyAI’s is competitive. If real-time is the primary requirement, Deepgram is the safe pick.

For multilingual transcription with broad language coverage — OpenAI Whisper API. 99 languages with the Whisper architecture’s broad pre-training advantage; quality holds further down the long-resource list than Deepgram’s narrower Nova-3 language set. AssemblyAI’s Universal-2 also supports 99+ languages and is competitive — test on your specific languages.

For very high volume at cost-sensitive workloads (more than 1,000 audio-hours per month) — Deepgram has the best per-minute economics on the Nova-3 base tier and offers self-hosted deployment for the highest-volume customers. AssemblyAI’s Slim tier is also competitive at high volume. At this volume, also consider self-hosting Whisper — see Transcribe audio at scale on a local machine for the break-even math.

The numbers

What you'll actually pay

OpenAI gpt-4o-mini-transcribe $0.003 per minute — cheapest flagship-quality option

OpenAI gpt-4o-transcribe $0.006 per minute — slightly higher accuracy on the same audio

Deepgram Nova-3 pre-recorded $0.0043 per minute base; volume discounts available

Deepgram Nova-3 streaming $0.0077 per minute base; volume discounts available

AssemblyAI Slim (transcription only) $0.002 per minute — cheapest entry tier in the comparison

AssemblyAI Universal-2 + audio intelligence $0.0062 per minute including PII redaction, sentiment, summarisation, etc.

Typical workload cost — 600 audio-hours/month $108 (OpenAI mini) · $155 (Deepgram pre-recorded) · $72–$223 (AssemblyAI, slim to full bundle)

Break-even vs self-hosted Whisper (RTX 4090 + workstation) About 9 months at 600hr/mo audio — see Transcribe locally for full math

Free credit on signup Deepgram: $200 · AssemblyAI: $50 · OpenAI: none

Latency — pre-recorded API OpenAI / Deepgram / AssemblyAI all return within ~1.0–2.5× audio duration for typical workloads

Latency — streaming Deepgram leads at sub-300ms typical; OpenAI and AssemblyAI competitive but slightly higher

The headline pricing is close enough across the three that cost shouldn’t drive the decision below 1,000 hours/month. Above that, the per-minute differences compound — and self-hosting Whisper becomes worth re-evaluating against any of the three.

What changes between now and the next refresh

Volatility notes

This category moves quickly. Concrete watch-list for the next two quarters:

OpenAI’s transcription pricing. OpenAI cut Whisper API pricing materially in late 2025 with the gpt-4o-mini-transcribe launch; further cuts likely as inference economics improve.
Deepgram’s Nova series. Nova-3 shipped in 2025; expect a Nova-4 announcement on a roughly annual cadence with accuracy and language improvements.
AssemblyAI’s audio-intelligence bundle. The most distinctive feature set is also the one competitors are most likely to replicate; expect Deepgram and OpenAI to bundle more audio-analytics features over 2026.
Open-source contenders. Distil-Whisper, Faster-Whisper, and the active research community around Whisper variants mean self-hosting keeps getting cheaper and faster — the break-even point moves left every six months.
Real-time / voice-agent integration. The transcription APIs are increasingly bundled with voice-agent platforms (OpenAI Realtime API, Deepgram Voice Agent). The “transcription-only API” market may not look the same a year from now.

Re-verify every 3–6 months. Pricing and audio-intelligence feature lists drift the fastest.

What's next

Related work

For the self-hosted Whisper alternative at high volume, see Transcribe audio at scale on a local machine. For the specific case of transcribing sales calls and customer interviews, see Voice transcription for sales calls and customer interviews. For drafting replies from the transcribed content, see Draft customer support replies that hold up to scrutiny. For multilingual content where translation matters more than the source transcription, see AI translation services compared.

Common questions

FAQ

Can I trust the WER numbers each vendor publishes?

Vendor-published WER numbers should be treated as upper bounds, not predictions. Each vendor benchmarks on the audio that flatters their model — typically clean, US-accented, single-speaker reads. Real-world WER on your audio is what matters. Run a small sample of your actual content through each candidate and measure manually — 30 minutes of representative audio is enough to spot the meaningful differences.

Should I just use OpenAI for everything since I already use their other APIs?

It's a reasonable default for transcription-only workloads on clean audio. It stops being the right default when you need built-in diarisation, audio-intelligence bundling, or PII redaction — those are real engineering work to layer on top of the OpenAI API, and you'd be reinventing what Deepgram or AssemblyAI ship out of the box. Pick based on the audio shape, not on existing vendor relationships.

How does this compare to running Whisper locally?

Local Whisper has the best economics above 30 hours/week of audio (see transcribe audio at scale on a local machine for the break-even math). At lower volume, the cloud APIs are simpler and the cost difference is negligible. Local also gives you data control — for privacy-sensitive audio (interviews, legal recordings, customer calls), local is sometimes the only acceptable answer regardless of cost. The Whisper open-weights model is the same architecture behind the OpenAI API, so accuracy is comparable; you trade convenience for control.

What about diarisation accuracy for 5+ speakers?

All three degrade past 4 speakers. Deepgram and AssemblyAI are competitive in the 2–4 range; both drop noticeably above 6. For specialised multi-speaker workloads (board meetings, multi-party interviews), consider a hybrid approach — diarise with a specialised pipeline (pyannote.audio) and transcribe with the cheapest accurate API. The split-pipeline often outperforms any single API on hard diarisation cases.

Which one is best for non-English languages?

Test on your specific language; don't trust the marketing-page numbers. OpenAI Whisper has the broadest pre-training and tends to win on long-resource languages outside the top ten. AssemblyAI Universal-2 supports a similar breadth. Deepgram Nova-3's language list is narrower but quality per supported language tends to be high. For low-resource languages (Yoruba, Khmer, Quechua, etc.), all three drop materially — budget for human review on published content.

Does any of them support speaker identification — knowing who specifically is talking?

All three support speaker diarisation (separating speakers labelled "speaker 0, 1, 2"), not speaker identification (knowing that speaker 0 is specifically "Alex Chen"). Identification requires a known voice profile per speaker and an additional matching layer — Deepgram has some support for this; AssemblyAI's identification features are growing; OpenAI doesn't ship it. For workflows that need named-speaker output (sales calls, interviews), the identification layer is usually built on top of whichever transcription API you choose.

The comparison matrix

What to actually use

What you'll actually pay

Volatility notes

Related work

FAQ

Can I trust the WER numbers each vendor publishes?

Should I just use OpenAI for everything since I already use their other APIs?

How does this compare to running Whisper locally?

What about diarisation accuracy for 5+ speakers?

Which one is best for non-English languages?

Does any of them support speaker identification — knowing who specifically is talking?

Sources & references

Related solutions

AI coding tools for non-engineers

AI meeting assistants compared (Otter, Fireflies, Granola, Read AI)

AI search APIs compared (Perplexity, Tavily, SerpAPI + LLM)

AI video editing tools compared (Descript, Captions, Opus Clip)