Cyberax AI Playbook
cyberax.com
Explainer · Foundations

RAG explained without acronyms

If you're an operator or founder considering a "chat with our docs" tool, this is what retrieval-augmented generation actually is, when you genuinely need it, and what it costs to set up. Written for the person making the decision — not the engineer wiring it up.

At a glance Last verified · May 2026
Problem solved Decide whether your team needs RAG, and understand what shipping it actually involves
Best for Operators, founders, ops leads evaluating whether to build a "chat with our docs" tool
Tools Pinecone, pgvector, LlamaIndex, OpenAI embeddings
Difficulty Beginner

A large language model doesn’t know your data. It doesn’t know your company’s pricing, your last quarter’s policy changes, your engineering wiki, your customer history. Ask it a question that depends on any of that and you’ll get a confident-sounding guess. (For why that happens, see What an LLM actually does for a business.)

Retrieval-augmented generation — usually shortened to RAG — is the standard fix. The name sounds technical; the idea is not. This piece explains it the way you’d explain it to a colleague over coffee, and helps you decide whether your team actually needs one.

The metaphor

An open-book exam, not a closed-book one

A naked LLM is a smart graduate sitting a closed-book exam. Bright, articulate, broadly knowledgeable, but limited to what they happen to remember. When the exam asks for the specific clause in your contract, they’ll either guess plausibly or say they don’t know.

A RAG system turns the same exam into an open-book one. The graduate is the same; what’s changed is that, before answering, they’re handed the few pages of your binder that are most relevant to the question. They read those pages, then write their answer based on what’s in front of them.

That’s the whole idea. Retrieval = “find the relevant pages.” Generation = “write the answer.” Augmented = “the generation is informed by what was retrieved.”

The three steps

What RAG actually does, in three steps

Step 1 — Prepare the binder, once. Your documents (whatever they are: PDFs, wiki pages, support tickets, contracts, transcripts) are split into small passages — typically a few hundred words each. Each passage is then converted into a long list of numbers that captures what the passage is about. (The numbers are called an embedding; the conversion is done by a separate small AI model.) Both the passage and its numbers are stored in a database designed to find similar number-lists fast — that’s a vector database. This step happens once, then again whenever your documents change.

Step 2 — Find the relevant pages, per question. When someone asks a question, the question itself is converted into the same kind of number-list. The database returns the handful of passages whose number-lists are closest — meaning, semantically most similar to the question. In modern systems this step is usually combined with old-fashioned keyword search and then a re-ranking model, because that hybrid approach catches things pure semantic search misses.

Step 3 — Have the model answer using those pages. The retrieved passages are pasted into the prompt along with the original question, with instructions like “Answer the question using only the information in the passages below. If the passages don’t contain the answer, say so.” The model then generates an answer grounded in what it just read.

Everything else in the RAG literature — chunking strategies, embedding model choice, hybrid search, re-ranking, contextual retrieval, query rewriting — is variations on these three steps to make them work better in your specific case.

When you need it

A small decision rule

You need RAG when all of the following are true: the answers depend on documents the model wasn’t trained on (your data, post-cutoff data, anything proprietary), the document corpus is too large to paste into a single prompt every time, and the questions are open-ended enough that you can’t anticipate them all.

You don’t need RAG when any of these is true:

  • Your corpus is small enough to fit in the model’s context window — and modern flagship models can hold up to about 1–2 million tokens, which is roughly an entire library shelf. If your knowledge base is a 30-page handbook, just paste it in. The engineering complexity of RAG is wasted on small corpora.
  • The questions are highly structured — for instance, “what’s the price of product X?” That’s a database lookup, not a retrieval problem. Use SQL.
  • The model already knows the answer — for general public knowledge, the base model is often fine without retrieval. Test before you build infrastructure.
  • The information changes by the second — financial market data, live system status, real-time inventory. Use an API call or a tool, not a retrieval index that you have to rebuild constantly.
The numbers

What it costs to set up and run

Costs split into two parts: building the index, and answering each question.

Embedding cost — index build $0.02–$0.13 per million tokens (≈ $0.50–$3 to embed a 25-million-word corpus)
Vector database — managed (Pinecone, Weaviate Cloud) $20–$700/month depending on scale (Pinecone Builder $20/month entry; Enterprise $500+/month)
Vector database — self-hosted (pgvector on existing Postgres) $0 incremental if you already run Postgres
Per-question cost — retrieval Fractions of a cent (one embedding call + one DB query)
Per-question cost — generation with retrieved context $0.001–$0.05 typical (more context = higher cost)
Engineering time — first working prototype 1–3 days using a framework like LlamaIndex
Engineering time — production-ready (eval, monitoring, hybrid search, re-rank) 3–8 weeks

The gap between “first prototype” and “production-ready” is where most teams discover what RAG actually costs. The prototype is easy; making it reliable is the work.

Where it fails

The failure modes that bite teams in production

The 2026 consensus across vendor and academic write-ups is consistent: when RAG fails, it almost always fails at retrieval, not generation. The model is asked to answer using passages it received — if the right passage wasn’t in the bag, the answer is wrong.

The specific failure modes worth knowing about:

  • Right answer, wrong passage. The retriever returned passages that seem topically related but don’t actually contain the answer. The model then either makes something up that fits the passages or politely says it doesn’t know.
  • Multi-hop questions. “Which of our policies changed between 2024 and 2025?” requires retrieving and comparing two documents — most naive RAG systems retrieve passages independently and miss the comparison.
  • Loss of context when chunking. Splitting a contract at fixed character lengths can put a definition on one side and the term it defines on the other. Anthropic’s Contextual Retrieval (linked below) is the current best-practice fix — it prepends a one-line summary to each chunk before storage so the chunk is less context-dependent.
  • Out-of-date index. Your retrieval is only as fresh as your last re-index. Plan for incremental updates from day one.
  • No evaluation harness. If you can’t measure whether retrieval is improving, you’ll spend months tuning blindly. Build a small evaluation set of “question, expected source passage” pairs early — even fifty examples beats vibes.
What's next

If you decide to build one

Read Build a private knowledge base your team can search for the practical setup, including the framework choice, the cheap-vs-managed vector database call, and the hybrid-search-plus-rerank pattern that the field has converged on.

Common questions

FAQ

Do I need a vector database, or can I just use my existing one?

If you already run Postgres, the pgvector extension lets you do vector search inside your existing database — no new infrastructure needed. For most teams under a few million documents, that's the right starting point. Reach for a dedicated vector database (Pinecone, Weaviate, Qdrant) when you cross hundreds of millions of vectors or need very low-latency search at high concurrency.

How is RAG different from fine-tuning the model?

Fine-tuning teaches the model new style or new behaviour. RAG gives the model new information at the moment of the question. If the model doesn't know your facts, RAG is almost always cheaper, faster, and easier to update than fine-tuning. Fine-tune when you need consistent tone or domain-specific phrasing across thousands of outputs; use RAG when you need accurate facts.

Can I do RAG without writing code?

Yes — vendors now offer hosted RAG: ChatGPT "Custom GPTs" with file uploads, Claude Projects with attached context, Google Notebook LM, Microsoft Copilot's grounding on SharePoint. These work well for small corpora and one-team use. For anything that has to scale, integrate with multiple data sources, or be controlled with care, you'll write code (or your team will).

How accurate is RAG in practice?

It depends almost entirely on retrieval quality. A well-built RAG system on clean documentation regularly answers 85–95% of in-scope questions correctly. A naive RAG system thrown together in a weekend on messy PDFs will land closer to 50–70%. The gap is the work — chunking strategy, hybrid search, re-ranking, and an evaluation harness explain most of the difference.

Will RAG let the model cite its sources?

Yes — and you should make it. The standard pattern is to instruct the model to include a reference to the retrieved passage that supports each claim, then surface those references to the reader as clickable citations. This converts a black-box answer into something a user can verify, which is most of what makes a RAG system trustworthy in practice.

Sources & references

Change history (1 entry)
  • 2026-05-10 Initial publication.