An embedding is the way an AI represents the meaning of a piece of text as a list of numbers, so similar things can be compared mathematically. The embedding model is what produces those numbers. Pick a good one and your semantic-search, deduplication, classification, and RAG (retrieval-augmented generation — giving AI access to your specific documents) features work; pick a poor one and they don’t.
The embedding model decision quietly shapes every downstream retrieval choice. Once you’ve embedded 10 million documents in OpenAI text-embedding-3-large at 3,072 dimensions, you’re committed. Switching models means re-embedding the corpus and revalidating the vector database (a database that stores embeddings and lets you search by meaning). The “we’ll just pick one and move on” decision is the one that benefits most from honest comparison — the cost of getting it wrong is meaningful, and the marketing pages don’t surface the trade-offs that matter for your specific data.
Onward: retrieval-quality benchmarks, dimension-and-cost math, multilingual capability, self-hosting trade-offs, and the testing pattern that beats “pick the leader on MTEB and move on.”
The comparison matrix
| OpenAI text-embedding-3-large | Cohere embed-v3 | Voyage AI (voyage-3) | Jina v3 | sentence-transformers (open source) | |
|---|---|---|---|---|---|
| Retrieval quality (MTEB benchmarks) | Strong; leading commercial scores at large size | Strong; close to OpenAI on most tasks | Strong; competitive top-tier benchmarks | Good; close behind the top commercial three | Variable by model; top open-source options reach top-commercial level on many tasks |
| Default dimension count | 3,072 (large) / 1,536 (small) — can be reduced via Matryoshka | 1,024 (embed-v3) — fixed | 1,024 (voyage-3) — fixed | 1,024 (jina-embeddings-v3) | Varies: 384 (mini-LM), 768 (all-mpnet), up to 1024+ |
| Dimension reduction (Matryoshka) | Yes — reduce to 256 / 512 / 1,024 with modest quality loss | No native; static dimensions | No native | Yes — Matryoshka support for flexible truncation | Some models support it via training |
| Multilingual support | 100+ languages; strong on major | 100+ languages; strong on European and Asian | 100+ languages; documented strong on multilingual benchmarks | Strong multilingual; designed for it from v3 | Multilingual models exist; quality varies |
| Pricing (per 1M tokens) | $0.13 (3-large) / $0.02 (3-small) | $0.10 | $0.06 (voyage-3) / $0.02 (voyage-3-lite) / $0.18 (voyage-3-large) | $0.10 typical (v3); v4 pricing not disclosed publicly | $0 software; infra-cost only |
| Self-hostable | No | No | No | Some Jina models are open-weights | Yes; the primary positioning |
| Rerank companion API | No native rerank | Yes — Cohere Rerank (separate API) | Yes — voyage-rerank | Yes — Jina Rerank | Open-source rerank models available |
| API maturity | Excellent — broad SDK coverage | Excellent — well-documented | Strong; smaller ecosystem than OpenAI | Strong; growing ecosystem | Hugging Face Transformers + many wrappers |
| Best for | General-purpose; teams already on OpenAI | Search + rerank workflows; multilingual | Long-context (up to 32k input tokens) | Multilingual; cost-conscious commercial | Self-hosted, privacy-sensitive, custom-trained |
| Max input context | 8,192 tokens | 512 tokens (embed-v3); 128,000 tokens (embed-v4 — current flagship) | Up to 32,000 tokens (voyage-3 family) | 8,192 (v3); 32,768 (v4 — current flagship) | Typically 512 tokens; depends on model |
What to actually use
For general-purpose RAG with broad ecosystem support — OpenAI text-embedding-3-large or text-embedding-3-small. The ecosystem support is the widest; every vector store has native OpenAI integration; the Matryoshka reduction lets you trade dimension for cost. Trade-off: not the cheapest at high volume, no native rerank.
For semantic-search workflows that benefit from a paired rerank model — Cohere embed-v3 with Cohere Rerank. The two-stage retrieve-then-rerank pattern with both from one vendor is well-documented; quality is competitive with OpenAI. Right for production search systems where rerank materially improves precision.
For long-context input (entire articles or documents per embedding) — Voyage AI’s voyage-3 with 32k input. Most other models cap at 512 or 8,192 tokens; voyage handles much longer inputs as a single embedding, which simplifies architectures that would otherwise need chunking. Right for whole-document retrieval workflows.
For multilingual workloads at moderate cost — Jina v3 or Cohere embed-v3. Both have documented strength on multilingual benchmarks; Jina trades slight quality for slightly lower cost at high volume. Right when your corpus spans many languages and per-language quality matters.
For privacy-sensitive or self-hosted operation — sentence-transformers (open-source). The all-mpnet-base-v2 and BGE family are competitive on quality with the commercial mid-tier, free software, and self-hostable on modest hardware. Right when data can’t leave your infrastructure (regulated industries, on-prem requirements).
For very-high volume cost optimisation — Either OpenAI text-embedding-3-small (good quality, low cost) or self-hosted open-source. At 100M+ embeddings, the cost differences compound significantly; rerun the math on your actual volume before defaulting.
What you'll actually pay
The per-token cost is small; the dimension-driven storage cost adds up at scale. Choose the dimension that fits your retrieval quality bar — higher isn’t always better, especially after rerank.
Volatility notes
- New models frequently. OpenAI, Cohere, Voyage, Jina all iterate; quality leaders shift on benchmarks every few months.
- Open-source catching up. Top open-source models (BGE, GTE, E5) regularly match mid-tier commercial on MTEB.
- Specialised embeddings. Domain-specific models (legal, medical, code) emerging with documented quality lifts on their domains.
Re-verify every 6 months for cost; quality benchmarks shift faster than pricing.
Related work
For the underlying embeddings mental model, see Embeddings explained without math. For the vector-database choice that stores these embeddings, see Vector databases compared. For the broader RAG pattern these power, see RAG explained without acronyms. For the internal-Q&A bot architecture that combines embeddings with generation, see Internal Q&A bot over company docs.
FAQ
How do we test embedding models on our actual data?
Build a test set of ~50 query-result pairs from your domain (queries plus the documents that should match). Embed your corpus with each candidate model; for each query, measure recall@5 and recall@10 against the ground-truth matches. The model that wins on your data may differ from the MTEB leader — your specific domain shape matters more than generic benchmarks.
Can we mix embedding models in one system?
Generally no — different models produce different vector spaces, and cosine similarity across models is meaningless. The exception: keep separate indexes per model and combine results at a higher level (hybrid retrieval, rerank). Most teams pick one model and commit; switching means re-embedding the corpus.
What about dimension count — bigger is better?
Not always. Higher-dimension embeddings cost more to store and compute against; the quality difference at top sizes is often small. Matryoshka-trained models let you truncate to lower dimensions with modest quality loss. Test on your data — many teams find 512–1,024 dimensions sufficient for production retrieval.
Should we self-host open-source embeddings?
If you have ML infra capacity and care about cost-at-scale or data control, yes. If you don't, the cost of running the infrastructure exceeds the API cost for most workloads. Self-host when you're at 50M+ embeddings or have strict data-residency requirements; use the API otherwise.