Cyberax AI Playbook
cyberax.com
How-to · Operations & Knowledge · Local-OK

Build a private RAG with no third-party calls

If you're an engineer at a security-sensitive or regulated company, this is a retrieval-augmented setup — giving the AI access to your specific documents — that never leaves your own machines. Embeddings, vector store, retrieval, and response synthesis all run locally. Hardware sizing, latency benchmarks, ongoing eval, and the failure modes specific to this configuration.

At a glance Last verified · May 2026
Problem solved Stand up a retrieval-augmented Q&A system entirely on your own infrastructure — no API calls to OpenAI, Anthropic, Cohere, or any other vendor — and know what you give up in exchange
Best for Engineers at security-sensitive companies, regulated industries (legal, medical, defence), founders concerned about long-term vendor risk
Tools BGE-M3, pgvector, Qdrant, Llama 3.3, Mistral Small, Ollama, vLLM
Hardware Single high-end consumer GPU (RTX 4090 / 5090) viable for small corpora; A100 / H100 class for production
Difficulty Advanced
Cost $0 incremental software (open-weights + open-source); $1,500–$30,000 hardware depending on tier
Time to set up 2–4 weeks for a functional pilot; 2–3 months to a production-quality system with eval

If you work at a legal firm with client-privileged documents, a regulated industry like healthcare or defence, or a company whose customers contractually forbid third-party processing, the standard RAG tutorial doesn’t apply to you.

Most RAG tutorials — RAG meaning retrieval-augmented generation, the technique of giving an AI access to your specific documents — assume you’ll call OpenAI for embeddings (the way AI represents meaning as numbers), Cohere or Voyage for reranking, and one of the frontier LLMs for generation. For many teams that’s the right answer. The quality is high, the operational burden is low, and the cost per query is small until volume gets serious.

For a meaningful minority of teams, none of that is acceptable.

What follows is the workflow for those teams: a complete RAG pipeline that never makes a third-party API call (an API, or application programming interface, is the way one piece of software calls another). Different components — smaller models, self-hosted vector store, your own evaluation loop — and real trade-offs. The architecture is well-understood in 2026 and shippable in two to four weeks with one engineer who knows what they’re doing.

When to use

Where this fits — and where it doesn't

Use this if data sensitivity is the dominant constraint — your data cannot leave your network, your customers’ contracts forbid third-party processing, your regulator requires data residency in a specific country, or your industry treats vendor risk as an existential category. The architecture also makes sense when you have predictable high-volume workloads and the unit economics of self-hosting (with engineering investment) beat the per-call API costs at your scale.

Don’t use this if you have fewer than ~10,000 documents (the maintenance cost dominates the privacy benefit at that scale; use the regular private knowledge-base pattern instead), your latency budget is sub-second at high concurrency (frontier APIs still win on p99 latency under load), your team has no ops capacity (this is real infrastructure, not a SaaS subscription), or you’re optimising for “best possible answer quality” — the open-weights models are very good in 2026 but still trail Claude Opus and GPT-5 on the hardest reasoning tasks.

Prerequisites

What you'll need before starting

  • A GPU. Single RTX 4090 or RTX 5090 is enough for small-team pilots; A100/H100 (or rented equivalent on a sovereign-cloud provider) for production load.
  • An engineer comfortable with Linux, Docker, and Python. This is not a no-code workflow.
  • A corpus that’s been cleaned and chunked — see RAG explained without acronyms and the knowledge-base setup for the upstream work. Garbage in is garbage out for any RAG, private or otherwise.
  • A small labelled eval set — 50–200 question/answer pairs that represent what users will actually ask. Without this, you cannot tell whether the system is good enough.
  • A real decision on what “private” means for your team. Air-gapped (no internet at all)? Same data centre as your other workloads? Sovereign cloud in a specific jurisdiction? The answer shapes the deployment.
The solution

Eight steps to a working private RAG

  1. Size the hardware against your corpus and concurrency

    Three ballparks to anchor on. Pilot tier: RTX 4090 (24 GB VRAM), 64 GB RAM, 2 TB NVMe — enough for up to ~5M chunks, a 7B–32B generation model, and 1–5 concurrent users. Production small: single A100 80GB or H100, 128–256 GB RAM, fast storage — supports 70B-class generation, 50M+ chunks, 10–30 concurrent users with caching. Production large: multi-GPU node with vLLM or TGI, dedicated retrieval host, separate vector-store host — for 100+ concurrent users. The biggest sizing mistake is under-provisioning RAM; the model weights are obvious, the retrieval index and cache are easy to forget.

  2. Pick the embedding model — multilingual or English-only

    For multilingual corpora, BGE-M3 (BAAI) is the strongest open-weights embedder in 2026 — handles 100+ languages, dense + sparse + multi-vector retrieval, runs comfortably on consumer GPUs. For English-only at higher throughput, BGE-small-en-v1.5 or nomic-embed-text-v1.5 are faster with a small quality trade-off. Whatever you pick, lock it in — re-embedding the corpus is the most expensive operation in this stack, and switching embedders mid-stream invalidates every prior vector.

  3. Pick the vector store — pgvector if you have Postgres, Qdrant otherwise

    For most teams already running Postgres, pgvector is the right answer — backups, monitoring, access control, and operational practice all already exist for your database. For standalone deployments or teams that need filtering and metadata at scale, Qdrant is the cleanest open-source vector database and runs comfortably on a single machine for tens of millions of vectors. Both support hybrid retrieval (vector + keyword) which materially improves recall on real-world corpora. See vector databases compared for the longer trade-off discussion.

  4. Build the ingestion pipeline — chunk, embed, store, version

    Chunking strategy depends on document type: code and tables benefit from structural chunks (per function, per row); long prose works at 500–800-token paragraph-aware chunks with 50–100 tokens of overlap. Embed in batches (256–1024 docs at a time on the chosen GPU) and write to the vector store with rich metadata — source filename, section, last-modified timestamp, access-control tags. Version the ingestion run; when you re-ingest, write to a new namespace and switch reads only after eval passes. Re-ingesting in place is the most common way to lose retrieval quality without noticing.

  5. Pick the generation model — quality vs latency trade-off

    In 2026, three reasonable defaults. Llama 3.3 70B for the quality-leaning default — strongest open-weights generalist, runs on a single A100 with quantisation or two consumer GPUs. Mistral Small 3 or Mistral Medium for the latency-leaning default — faster, smaller, still strong for grounded Q&A. Qwen 2.5 32B for multilingual generation where Llama’s English-centric training shows. The right choice is workload-specific; benchmark on your own eval set before locking in.

  6. Wire retrieval with reranking — k=20 dense, rerank to k=5

    Dense retrieval alone leaves ~10–20% recall on the table for most corpora. Retrieve top 20 chunks by vector similarity, then rerank with a cross-encoder (BGE-reranker-v2 is the current open-weights default) to k=5 before sending to the generation model. The rerank step is computationally cheap (a few hundred milliseconds on the same GPU as the embedder) and improves answer quality more than upgrading the generation model in many workloads. Hybrid retrieval (BM25 + dense, fused with reciprocal rank fusion) is a further 5–10% recall improvement on real corpora.

  7. Engineer the response-synthesis prompt — grounding instructions and refusal patterns

    Open-weights models follow instructions less reliably than the frontier closed models; the prompt has to do more work. Structure the synthesis prompt with: explicit grounding instructions (“answer only from the provided context”), citation requirements (“include the source filename for each factual claim”), refusal patterns (“if the answer is not in the context, say so explicitly — do not extrapolate”), and length/style constraints. Test the prompt against adversarial questions (questions whose answers aren’t in your corpus) — the refusal behaviour is where open-weights models most commonly fall short of expectations.

  8. Run continuous eval — without it, you cannot tell when the system regresses

    Use Ragas or a hand-rolled eval against your labelled set. Track four metrics: faithfulness (does the answer match the retrieved context?), answer relevancy (is the answer on-topic?), context precision (are the retrieved chunks actually relevant?), context recall (did you retrieve everything needed?). Re-run the eval after every change — new chunks ingested, system prompt edits, model upgrades, retrieval-parameter tweaks. The eval is what lets you ship confidently; teams that don’t have one find out the system has degraded only when users complain.

The numbers

What you'll actually pay (and wait)

Pilot hardware — single RTX 4090 (24 GB) build $1,500–$2,500 GPU + ~$800–$1,500 supporting hardware
Production small — single A100 80GB or H100 $8,000–$30,000 capex or ~$1.50–$3/hr rented (sovereign cloud)
Latency — embedding query (BGE-M3 on RTX 4090) 20–50 ms per query
Latency — vector retrieval (pgvector, 10M chunks) ~50–150 ms typical with HNSW index
Latency — reranking top-20 with BGE-reranker-v2 ~100–250 ms
Latency — generation (Llama 3.3 70B, single A100, vLLM) p50 ~1.5–3 s for 200-token answers; p95 ~5–8 s under load
Throughput — full pipeline, single A100, batched ~5–20 queries/sec sustained at acceptable latency
Quality gap — best open-weights (Llama 3.3 70B) vs Claude Opus 4.7 on grounded Q&A Closing — within 5–10% on most benchmarks; larger gap on adversarial / multi-hop
Cost per query at production scale (one A100, 10 qps sustained) ~$0.0001–$0.0005 — orders of magnitude below frontier-API equivalents
Eval set size to start trusting the metrics 50–200 question/answer pairs; ideally curated by domain experts
Re-embedding cost when you change embedder (corpus of 10M chunks) ~6–24 hours on a single GPU; plan re-embedding as a project not a routine task

The cost differential vs frontier APIs is real but rarely the actual driver — teams that pick this path are picking it for data control. The cost story is a tailwind, not the headline.

Alternatives

Other ways to solve this

The “fully managed but private” middle path. Sovereign-cloud LLM offerings — AWS Bedrock with PrivateLink + customer-managed keys, Azure OpenAI in a private VNet, Vertex AI in a customer-controlled VPC, Anthropic via AWS Bedrock — give you frontier-model quality with much stronger privacy guarantees than the public consumer endpoints. Faster to deploy; less control than self-hosted. Right answer when your privacy concern is “third party shouldn’t see this” but not “data must never leave my data centre.”

The hybrid: open-weights for retrieval, frontier for generation. Some teams run embeddings locally (cheap, deterministic, no data leaves the network for the corpus) but route the final synthesis to a closed-model API. Reduces vendor data exposure to per-query without the full ops burden. Right answer when retrieval scale is large and synthesis quality matters more than synthesis privacy.

A simpler search-first approach. For some corpora, a strong full-text search with snippet extraction (Elasticsearch, Tantivy, or a tuned Postgres FTS) gets 70% of the value at 10% of the operational complexity. Right answer when users will accept “search results with highlights” rather than expecting a single synthesised answer.

Wait twelve months and re-evaluate. The open-weights frontier moves quickly. The right architecture in mid-2027 may include capabilities that don’t exist today (much better small models, cheaper long-context handling, mature on-device inference for some workloads). If your privacy constraint is severe but not urgent, a one-year delay can materially change which tools you’d pick.

Common questions

FAQ

Can I run this on Apple Silicon?

Yes for a single-user pilot — M2/M3 Ultra with 128–192 GB unified memory handles Llama 3.3 70B at usable speeds via Ollama or llama.cpp. Not yet for production multi-user load — the throughput at sustained concurrency is well below what a single A100 delivers, and the tooling for multi-user serving is less mature on macOS than on Linux/CUDA. Good for dev environments and single-engineer testing; not the deployment target.

What's the quality gap vs OpenAI / Claude as of 2026?

On grounded Q&A from a well-built RAG, the gap is within 5–10% on most benchmarks — measurable but not large. On open-ended reasoning, complex code generation, or adversarial questions, the gap is larger and frontier models still lead. The honest answer is workload-specific; benchmark on your eval set before deciding the quality gap matters.

How does this handle PDFs, tables, and images in the corpus?

PDFs need extraction first — see extract structured data from PDFs for the upstream work. Tables benefit from being chunked as structured records rather than rendered prose. Images require a multimodal model in the pipeline (Llama 3.2 Vision or Qwen 2.5 VL on the open-weights side); same architecture, different model and chunking strategy. The orchestration is the same; the per-modality preparation isn't.

What about Whisper or vision models running locally?

Same architectural pattern; different models. Whisper-large-v3 runs comfortably on the same hardware as the embedder for transcription; Qwen 2.5 VL or Llama 3.2 Vision for image understanding. The right way to think about this is: the RAG pipeline is one specific data flow; the underlying "run inference locally" infrastructure supports many flows. Build the platform once; reuse it.

When does "private" become "we should have used the cloud after all"?

Three honest triggers. (1) Latency under load you can't meet — if p95 is degrading on your concurrency target, frontier APIs scale more elastically. (2) Engineering burden you can't sustain — one engineer is enough for a pilot; production at scale needs two or three. (3) A specific capability the open-weights world doesn't have yet — long-context handling beyond ~128k tokens, very fast tool-use, multimodal reasoning at frontier quality. If any of those binds, revisit the sovereign-cloud middle path before scaling the self-hosted setup further.

Sources & references

Change history (1 entry)
  • 2026-05-11 Initial publication.