Embeddings models compared for semantic search

An embedding is the way an AI represents the meaning of a piece of text as a list of numbers, so similar things can be compared mathematically. The embedding model is what produces those numbers. Pick a good one and your semantic-search, deduplication, classification, and RAG (retrieval-augmented generation — giving AI access to your specific documents) features work; pick a poor one and they don’t.

The embedding model decision quietly shapes every downstream retrieval choice. Once you’ve embedded 10 million documents in OpenAI text-embedding-3-large at 3,072 dimensions, you’re committed. Switching models means re-embedding the corpus and revalidating the vector database (a database that stores embeddings and lets you search by meaning). The “we’ll just pick one and move on” decision is the one that benefits most from honest comparison — the cost of getting it wrong is meaningful, and the marketing pages don’t surface the trade-offs that matter for your specific data.

Onward: retrieval-quality benchmarks, dimension-and-cost math, multilingual capability, self-hosting trade-offs, and the testing pattern that beats “pick the leader on MTEB and move on.”

Side by side

The comparison matrix

	OpenAI text-embedding-3-large	Cohere embed-v3	Voyage AI (voyage-3)	Jina v3	sentence-transformers (open source)
Retrieval quality (MTEB benchmarks)	Strong; leading commercial scores at large size	Strong; close to OpenAI on most tasks	Strong; competitive top-tier benchmarks	Good; close behind the top commercial three	Variable by model; top open-source options reach top-commercial level on many tasks
Default dimension count	3,072 (large) / 1,536 (small) — can be reduced via Matryoshka	1,024 (embed-v3) — fixed	1,024 (voyage-3) — fixed	1,024 (jina-embeddings-v3)	Varies: 384 (mini-LM), 768 (all-mpnet), up to 1024+
Dimension reduction (Matryoshka)	Yes — reduce to 256 / 512 / 1,024 with modest quality loss	No native; static dimensions	No native	Yes — Matryoshka support for flexible truncation	Some models support it via training
Multilingual support	100+ languages; strong on major	100+ languages; strong on European and Asian	100+ languages; documented strong on multilingual benchmarks	Strong multilingual; designed for it from v3	Multilingual models exist; quality varies
Pricing (per 1M tokens)	$0.13 (3-large) / $0.02 (3-small)	$0.10	$0.06 (voyage-3) / $0.02 (voyage-3-lite) / $0.18 (voyage-3-large)	$0.10 typical (v3); v4 pricing not disclosed publicly	$0 software; infra-cost only
Self-hostable	No	No	No	Some Jina models are open-weights	Yes; the primary positioning
Rerank companion API	No native rerank	Yes — Cohere Rerank (separate API)	Yes — voyage-rerank	Yes — Jina Rerank	Open-source rerank models available
API maturity	Excellent — broad SDK coverage	Excellent — well-documented	Strong; smaller ecosystem than OpenAI	Strong; growing ecosystem	Hugging Face Transformers + many wrappers
Best for	General-purpose; teams already on OpenAI	Search + rerank workflows; multilingual	Long-context (up to 32k input tokens)	Multilingual; cost-conscious commercial	Self-hosted, privacy-sensitive, custom-trained
Max input context	8,192 tokens	512 tokens (embed-v3); 128,000 tokens (embed-v4 — current flagship)	Up to 32,000 tokens (voyage-3 family)	8,192 (v3); 32,768 (v4 — current flagship)	Typically 512 tokens; depends on model

The decision

What to actually use

For general-purpose RAG with broad ecosystem support — OpenAI text-embedding-3-large or text-embedding-3-small. The ecosystem support is the widest; every vector store has native OpenAI integration; the Matryoshka reduction lets you trade dimension for cost. Trade-off: not the cheapest at high volume, no native rerank.

For semantic-search workflows that benefit from a paired rerank model — Cohere embed-v3 with Cohere Rerank. The two-stage retrieve-then-rerank pattern with both from one vendor is well-documented; quality is competitive with OpenAI. Right for production search systems where rerank materially improves precision.

For long-context input (entire articles or documents per embedding) — Voyage AI’s voyage-3 with 32k input. Most other models cap at 512 or 8,192 tokens; voyage handles much longer inputs as a single embedding, which simplifies architectures that would otherwise need chunking. Right for whole-document retrieval workflows.

For multilingual workloads at moderate cost — Jina v3 or Cohere embed-v3. Both have documented strength on multilingual benchmarks; Jina trades slight quality for slightly lower cost at high volume. Right when your corpus spans many languages and per-language quality matters.

For privacy-sensitive or self-hosted operation — sentence-transformers (open-source). The all-mpnet-base-v2 and BGE family are competitive on quality with the commercial mid-tier, free software, and self-hostable on modest hardware. Right when data can’t leave your infrastructure (regulated industries, on-prem requirements).

For very-high volume cost optimisation — Either OpenAI text-embedding-3-small (good quality, low cost) or self-hosted open-source. At 100M+ embeddings, the cost differences compound significantly; rerun the math on your actual volume before defaulting.

The numbers

What you'll actually pay

OpenAI text-embedding-3-large $0.13 per 1M tokens — top quality, broadest ecosystem

OpenAI text-embedding-3-small $0.02 per 1M tokens — strong quality at much lower cost

Cohere embed-v4 $0.10 per 1M tokens — current flagship; 128k input context, multimodal

Voyage AI voyage-3 $0.06 per 1M tokens (older tier); 32k input context

Voyage AI voyage-3-lite $0.02 per 1M tokens; competitive on cost

Voyage AI voyage-3-large $0.18 per 1M tokens (top quality in the voyage line)

Jina embeddings v4 Current flagship; 32,768 token input; pricing not publicly disclosed

Self-hosted (sentence-transformers) $0 software; GPU infra cost typically $200–$1,000/month at scale

Embedding 1M documents (avg 1k tokens each) $130 (OpenAI large), $20 (OpenAI small), $100 (Cohere v4), $60 (Voyage-3), $180 (Voyage-3-large)

Re-embedding cost when changing models (1M docs) Same as initial cost — the switching cost is meaningful at large scale

Storage cost — vector store at 1M docs × dimension Higher dimensions = higher storage; Matryoshka reduction saves significantly

The per-token cost is small; the dimension-driven storage cost adds up at scale. Choose the dimension that fits your retrieval quality bar — higher isn’t always better, especially after rerank.

What changes between now and the next refresh

Volatility notes

New models frequently. OpenAI, Cohere, Voyage, Jina all iterate; quality leaders shift on benchmarks every few months.
Open-source catching up. Top open-source models (BGE, GTE, E5) regularly match mid-tier commercial on MTEB.
Specialised embeddings. Domain-specific models (legal, medical, code) emerging with documented quality lifts on their domains.

Re-verify every 6 months for cost; quality benchmarks shift faster than pricing.

What's next

Related work

For the underlying embeddings mental model, see Embeddings explained without math. For the vector-database choice that stores these embeddings, see Vector databases compared. For the broader RAG pattern these power, see RAG explained without acronyms. For the internal-Q&A bot architecture that combines embeddings with generation, see Internal Q&A bot over company docs.

Common questions

FAQ

How do we test embedding models on our actual data?

Build a test set of ~50 query-result pairs from your domain (queries plus the documents that should match). Embed your corpus with each candidate model; for each query, measure recall@5 and recall@10 against the ground-truth matches. The model that wins on your data may differ from the MTEB leader — your specific domain shape matters more than generic benchmarks.

Can we mix embedding models in one system?

Generally no — different models produce different vector spaces, and cosine similarity across models is meaningless. The exception: keep separate indexes per model and combine results at a higher level (hybrid retrieval, rerank). Most teams pick one model and commit; switching means re-embedding the corpus.

What about dimension count — bigger is better?

Not always. Higher-dimension embeddings cost more to store and compute against; the quality difference at top sizes is often small. Matryoshka-trained models let you truncate to lower dimensions with modest quality loss. Test on your data — many teams find 512–1,024 dimensions sufficient for production retrieval.

Should we self-host open-source embeddings?

If you have ML infra capacity and care about cost-at-scale or data control, yes. If you don't, the cost of running the infrastructure exceeds the API cost for most workloads. Self-host when you're at 50M+ embeddings or have strict data-residency requirements; use the API otherwise.

The comparison matrix

What to actually use

What you'll actually pay

Volatility notes

Related work

FAQ

How do we test embedding models on our actual data?

Can we mix embedding models in one system?

What about dimension count — bigger is better?

Should we self-host open-source embeddings?

Sources & references

Related solutions

AI coding tools for non-engineers

AI meeting assistants compared (Otter, Fireflies, Granola, Read AI)

AI search APIs compared (Perplexity, Tavily, SerpAPI + LLM)

AI video editing tools compared (Descript, Captions, Opus Clip)