Cyberax AI Playbook
cyberax.com
Explainer · Foundations

Embeddings explained without math

How AI systems turn text into positions on an invisible map of meaning — what they're used for, when they fail, and what you actually need to know to make decisions about them.

At a glance Last verified · May 2026
Problem solved Understand what embeddings are, what they're used for, and how to make informed decisions about embedding-related infrastructure without learning linear algebra
Best for Operators, founders, ops leads, anyone evaluating AI search / classification / similarity features
Tools OpenAI embeddings, Cohere Embed, BGE-M3, pgvector, Sentence Transformers
Difficulty Beginner

You’ve probably been told your team needs embeddings for some AI feature — semantic search, deduplication, classification, recommendations. The explanations you can find online split cleanly into two unhelpful halves: a hand-wave that skips what an embedding actually is, or a wall of linear algebra that assumes you remember what a dot product is.

The rest of this guide is the middle path. By the end of it you’ll know what an embedding represents, why “similarity becomes a number” is the unlock, what each major embedding model costs, and where embeddings quietly fail in production — without a single formula. If you’re then deciding which model to pick or where to store the things, you’ll have enough scaffolding to follow the more technical conversation.

The metaphor

A position on an invisible map of meaning

Imagine a giant invisible map. Every piece of text you’ve ever written — a sentence, a paragraph, a support ticket, a product description — gets placed somewhere on this map. The placement isn’t random and it isn’t alphabetical. It’s by meaning.

Two sentences that mean roughly the same thing — “my charger isn’t working” and “the power cable seems broken” — land next to each other on the map, even though they share almost no words. A sentence about something entirely different — “please cancel my subscription” — lands in a different neighborhood. The map has thousands of “directions” rather than the two of a paper map, which is what lets meaning be captured at all (one direction for tone, one for topic, one for register, and so on) — but you don’t need to think about that to use it.

That position on the map is the embedding. It’s produced by a small AI model called an embedder (not the same model that writes answers), which has been trained on enormous amounts of text to give similar meanings similar positions. Whenever you hear someone mention embedding a document or embedding a query, they mean: “ask the embedder where this text goes on the map.”

The whole thing is useful because once meaning is a position, similarity becomes a number. The closer two positions, the more similar the meanings. That’s the entire trick.

What they unlock

Why operators care about embeddings

Five things become easy once text has a position on the map. Each of these used to require either bespoke engineering or unreliable keyword matching; embeddings collapse them into the same operation.

  • Semantic search. A user types “how do I expense client dinner?” You embed the question, look up the documents whose positions are closest, and return them. The user finds what they meant, not what they typed — a customer support article titled “Travel and meal expense reimbursement” surfaces even though it shares no words with the question. This is the heart of RAG; see RAG explained without acronyms.
  • Deduplication. Two products in your catalogue might be the same item with different SKUs and slightly different descriptions. Embed each, measure the distance, and the duplicates surface as pairs with near-zero distance. The same trick finds near-duplicate articles, fraudulent ad listings, and copy-paste customer feedback.
  • Classification. Embed a handful of labelled examples per category (“billing question,” “technical issue,” “feature request”). When a new ticket comes in, embed it and route it to the nearest cluster. No training, no machine-learning pipeline — just which existing examples is this most like.
  • Clustering. Embed a few thousand customer feedback comments and group them by position. You get topic clusters without ever specifying the topics in advance. (See Find patterns in customer feedback for the practical workflow.)
  • Recommendation. Embed what a user has read, watched, or bought. Find other items whose embeddings are close. The “you might also like” recommendation is exactly which items live nearest on the meaning map.

These five use cases account for the overwhelming majority of “we need embeddings for X” projects you’ll hear about. They share the same core operation: place text on the map, then ask what’s near it?

What goes wrong

The failure modes that bite teams in production

Embeddings look like a free lunch — text in, position out, similarity for cheap — but four failure modes consistently surface once they’re in production.

  • Embedding-model mismatch. Every embedder produces positions on its own map. The map produced by OpenAI’s text-embedding-3-large is not the same map as the one produced by Cohere’s Embed v4 or BAAI’s BGE-M3. If you index your documents with one model and query with another, you’re asking “what’s near this position on map A?” using a position from map B — gibberish results. The rule: index and query with the same embedder, always. Switching embedders later means re-embedding everything.
  • Stale embeddings. Your documents change; their positions don’t update automatically. A pricing page edited last week still has last quarter’s embedding in the index until you re-embed it. For a small team, “re-embed on edit” is one line of code; for a large team, it’s the kind of plumbing that gets skipped during the prototype and becomes a quiet correctness bug six months in.
  • Language coverage gaps. Most embedders are trained predominantly on English. Quality on top-tier non-English languages (Spanish, French, German, Mandarin) is good and improving; quality on under-represented languages can be noticeably worse, with similarity scores that don’t track meaning the way they do in English. Multilingual embedders like Cohere Embed v4 and BGE-M3 are specifically designed for this — use them if your corpus is genuinely multilingual.
  • Domain shift. A general-purpose embedder trained on the open web has a fuzzy understanding of highly specialised domains — pharmaceutical regulatory text, derivatives trading filings, niche legal jargon. Two sentences that mean different things to a domain expert can land at the same position because the embedder hasn’t learned the distinction. Domain-tuned embedders exist; for most teams, the fix is either to evaluate before you commit (build a small ground-truth set of “these should be similar / these should not”) or to layer a reranker on top.

The single most common production failure is no evaluation harness. If you can’t measure whether retrieval is finding the right things, you’ll spend months tuning a system you can’t tell is broken. Even a small set — fifty (query, expected document) pairs you’ve labelled by hand — turns “this seems fine” into a number you can move.

The numbers

What this costs and how big it gets

OpenAI text-embedding-3-small — cost $0.02 per million tokens · 1,536 dimensions
OpenAI text-embedding-3-large — cost $0.13 per million tokens · 3,072 dimensions
Cohere Embed v4 — cost (API) ~$0.12 per million tokens · multilingual, multimodal
BGE-M3 (open weights) — cost $0 per call if self-hosted on existing inference; 1,024 dimensions
One-time embedding cost for a 25-million-word corpus (text-embedding-3-small) ~$0.50–$1
Storage per million vectors (1,536-dim, 4 bytes per dim) ~6 GB — fits in any modest Postgres
Storage per million vectors (3,072-dim, 4 bytes per dim) ~12 GB — still fits, but plan the disk
Per-query latency — pgvector under ~1M vectors < 50 ms typical
Per-query latency — managed vector DB at scale 10–80 ms typical at p99
"Good enough" retrieval — well-tuned setup on clean docs 85–95% top-k recall
"Naive setup" retrieval — single embedder, no eval, no rerank 50–70% top-k recall
Rerank pass adds (cost / latency / recall lift) ~$0.10–$2 per 1k queries · 30–80 ms · +5–15 points recall

Three patterns are worth pulling out of those numbers.

First, the embedding cost itself is almost free for any corpus you can realistically read. Embedding the entire Wikipedia in English costs roughly the price of a coffee. The expensive parts of an embedding-powered system are storage at scale, query-time generation, and engineering effort — not the embeddings themselves.

Second, dimensions matter for storage but not always for quality. A 3,072-dimension embedding stores twice as much per vector as a 1,536-dimension one. If your corpus is small (under a million documents) the difference is invisible; if you’re storing a hundred million vectors the difference is the cost of one extra server. The MTEB leaderboard (linked in Sources) is the place to check whether the bigger model is meaningfully better for your task — sometimes it is, often it isn’t.

Third, a reranker is the cheapest accuracy buy on the menu. If your naive setup is at 60% recall, adding a rerank pass on the top 50 results from your embedder typically pushes you to 75–85% for cents per thousand queries. It’s not a substitute for a good embedder, but it’s how most production-grade RAG systems claw back the gap.

Picking a model

A short decision rule

The model menu is overwhelming — the MTEB leaderboard alone has hundreds of entries. For most teams in 2026, the decision compresses to this:

  • English-only corpus, want it managed, don’t want to think about it → OpenAI text-embedding-3-small. Cheap, fast, well-supported by every framework. Move to text-embedding-3-large only if eval scores demand it.
  • Multilingual corpus, want it managed → Cohere Embed v4. Built specifically for cross-language similarity; handles dozens of languages with quality that doesn’t fall off a cliff outside English.
  • Privacy or cost is the binding constraint, can self-host → BGE-M3 (open weights, multilingual, runs on a modest GPU) or one of the top Sentence-Transformers models. Zero per-token cost; you pay for the inference server.
  • Specialised domain (legal, biomedical, financial filings) → evaluate a domain-tuned model from the Hugging Face hub against a general one on your own labelled set. Sometimes the domain model wins clearly; sometimes the general one with a reranker wins. Don’t guess — measure.

You can almost always start with the cheapest option, build the eval harness, and only upgrade when the harness tells you to.

What's next

If embeddings are part of your stack

Common questions

FAQ

How is an embedding different from a hash?

A cryptographic hash is designed so that any change to the input produces a wildly different output — that's the whole point. An embedding is designed so that meaning-preserving changes produce similar outputs. "My charger isn't working" and "the power cable seems broken" have nearly-identical embeddings but completely different hashes. Hashes answer "is this the exact same thing?"; embeddings answer "is this the same kind of thing?"

Which embedding model should I pick?

The honest default for an English corpus is OpenAI's text-embedding-3-small — it's cheap, fast, and well-supported. For multilingual content, Cohere Embed v4 or BAAI's BGE-M3 are the standard choices. If you're privacy-constrained or cost-sensitive at scale, BGE-M3 self-hosted is the right answer because there's no per-token cost. Don't pick by reputation; build a small evaluation set of "queries with expected matches" and test two or three candidates on your own data. The MTEB leaderboard (linked in Sources) is useful as a starting point, not as the final answer.

Can I embed images, audio, or PDFs with the same idea?

Yes — and increasingly with the same model. Multimodal embedders (Cohere Embed v4, OpenAI's image-and-text embedders, several open-weights models) place images and text on the same map, so you can search an image library with a text query. PDFs are extracted to text and chunked before embedding, which means the quality is gated by your extraction step — a bad OCR pass produces bad embeddings. (See Extract structured data from PDFs at scale for the extraction half.)

How often do I need to re-embed my documents?

When the document changes, or when you switch embedding models. Documents that haven't changed don't need re-embedding — embeddings are deterministic per model. The pattern most teams converge on is: re-embed on edit (a database trigger or a queue job), schedule a periodic full re-embed if you're worried about silent drift in your storage layer, and budget a full re-embed if you ever change embedders. The full re-embed of a typical corporate corpus is usually a one-coffee expense (see the numbers above).

What's a reranker and do I need one?

A reranker is a second, slower model that takes your top retrieved results — say, the top 50 closest embeddings to a query — and re-orders them based on a more careful read of how well each one actually answers the query. It's slower per item than embedding retrieval, but it only runs on the small shortlist, so the total cost stays modest. Production RAG systems on serious-quality docs almost always use rerankers because the recall lift (typically 5–15 points) is large for the cost. If your retrieval recall is already in the 90s on your eval set, you can probably skip the reranker. If it's in the 60s or 70s, a reranker is the single highest-leverage thing to add.

Sources & references

Change history (1 entry)
  • 2026-05-11 Initial publication.