Cyberax AI Playbook
cyberax.com
How-to · Operations & Knowledge · Local-OK

Build a private knowledge base your team can search

A practical setup for "chat with our docs" — a system that lets your team ask plain-language questions across your internal documents and get grounded answers with citations. The framework choice, the cheap-vs-managed vector database call (a database that stores meaning-as-numbers so you can search by similarity), and the hybrid-search-plus-rerank pattern the field has converged on. With cost ranges and the parts that bite teams in production.

At a glance Last verified · May 2026
Problem solved Give your team natural-language search across internal documents the model wasn't trained on
Best for Teams with a substantial knowledge corpus (wiki pages, PDFs, transcripts, contracts) where "ask a question, get an answer with sources" beats keyword search
Tools LlamaIndex, pgvector, Postgres, OpenAI embeddings, Cohere Rerank
Hardware None — runs on existing infrastructure
Difficulty Intermediate
Cost $0–$3/mo for embeddings on small corpora; $0–$700/mo for the vector DB depending on choice
Time to set up 1–3 days for a working prototype; 3–8 weeks to production

A new employee asks: “What’s our refund policy for enterprise contracts signed before 2024?” The answer lives in a Notion page somewhere, possibly also in a PDF in shared drive, and three Slack threads from last year. Finding it takes 20 minutes. With a private knowledge base — sometimes called RAG (retrieval-augmented generation, which means letting the AI search your documents before answering) — that same question gets a grounded answer in seconds, with links to the source pages.

If you’ve read RAG explained without acronyms and decided your team genuinely needs one, this is the practical follow-up. We’ll set up a “chat with our docs” system that handles a real corpus, returns grounded answers with citations, and is honest enough to say “I don’t know” when the answer isn’t in your data.

This is opinionated — the goal is the shortest path from zero to a working system you’d let a colleague use. Variations are noted along the way.

When to use

Where this solution fits — and where it doesn't

Use this if you have a substantial corpus (think hundreds of pages and up — wiki pages, PDFs, contracts, support tickets, transcripts), the questions people ask are open-ended, and the answers depend on the specific contents of those documents. Examples that fit naturally: an internal handbook search, a “find precedents in our past projects” tool, a customer-facing support assistant grounded in your help docs.

Don’t use this if your corpus is small enough to fit in a single prompt (under ~1 million tokens — modern flagship models can hold an entire library shelf), the questions are highly structured (“what’s the price of product X?” — that’s SQL), or the information changes by the second (live system status, market data — use an API call, not a retrieval index).

Prerequisites

What you'll need before starting

  • A Python environment, or comfort installing one (3.11+ recommended).
  • An API key from an LLM vendor (OpenAI, Anthropic, or Google) — you’ll use one for embeddings and one for generation.
  • Existing Postgres or willingness to run one — pgvector turns it into a vector database for free.
  • Your documents in a form you can read programmatically — files in a folder, an S3 bucket, a Notion export, a Confluence dump. Ugly is fine; messy is fine; gigabytes are fine. Locked behind a login you don’t have is not fine — solve access first.
The solution

Six steps to a working knowledge base

  1. Pick your stack — and default to the boring one

    Use LlamaIndex as the framework, pgvector on Postgres for storage, OpenAI text-embedding-3-small for embeddings, and Claude Sonnet or GPT-5 for generation. Reasons: LlamaIndex requires meaningfully less code than LangChain for a basic RAG pipeline, pgvector lives inside infrastructure you already run, and text-embedding-3-small is cheap and good enough for nine corpora out of ten. Resist the urge to mix vendors at this stage — pick one, ship, then optimise.

  2. Get a working prototype on a small slice — same day

    Take 50–100 representative documents (not your whole corpus), run them through LlamaIndex’s VectorStoreIndex.from_documents, point it at pgvector, and ask three real questions you’d want answered. Look at what comes back. If the answers are usable, you’ve validated the corpus is RAG-friendly. If they’re nonsense, the problem is almost always the documents themselves — and no amount of tuning will fix garbage input.

  3. Choose a chunking strategy that respects document structure

    The default fixed-size chunker (e.g. 512 tokens with 50-token overlap) is a fine starting point but loses a lot. Switch to a structure-aware splitter as soon as it matters: split on headings for Markdown / wiki content, split on sentences for prose, keep tables intact, keep code blocks intact. LlamaIndex ships several. The biggest reliability bump beyond defaults: contextual retrieval (Anthropic’s technique linked below) — prepend a one-line LLM-generated summary of the parent document to each chunk before embedding. Reduces retrieval failures by ~35% in their published benchmarks; the cost is one cheap LLM call per chunk at index time.

  4. Add hybrid search and a reranker — this is the big quality jump

    Naive RAG uses pure vector similarity. The 2026 consensus across vendor write-ups is consistent: hybrid search (vector + classical keyword via BM25) plus a reranker (a small specialised model that reorders the top-N candidates) is the single biggest quality improvement you can make. Concretely: retrieve the top 50–100 candidates via hybrid search, pass them to Cohere Rerank or a similar model, keep the top 5–10 for the LLM. This step is the difference between “the demo works” and “users actually use it.”

  5. Build a small evaluation set — even fifty examples beats vibes

    Write down 30–100 question-and-expected-answer pairs covering the kinds of questions your users will actually ask. Note which document(s) contain each answer. Now you can measure: did retrieval find the right document? Did generation use it correctly? Without this, you’ll spend months tuning blindly. Re-run the eval after every change. The eval set is the most valuable artefact you’ll build — back it up.

  6. Force citations and a “don’t know” path

    Instruct the model: “Answer using only the passages below. Cite the passage you used for each claim. If the passages don’t contain the answer, say so — do not guess.” Then surface those citations to the user as clickable links to the source document. Two effects, both critical: users can verify the answer (so they trust the system enough to use it), and the model is much less likely to fabricate when it’s instructed to cite. The “don’t know” path is the difference between a knowledge base and a confident-bullshit machine.

Cost & constraints

What this actually costs to run

Embedding cost — one-time index build ~$0.02 per million tokens (≈ $0.50–$3 to embed a 25-million-word corpus)
Vector DB — pgvector on existing Postgres $0 incremental
Vector DB — managed (Pinecone / Qdrant / Weaviate cloud) $20–$700/month depending on scale and vendor (Pinecone Builder $20/month, Standard from $50/month, Enterprise from $500/month)
Per-question retrieval cost Fractions of a cent (one embedding + one DB query + one rerank call)
Per-question generation cost $0.001–$0.05 typical (depends on context size and model)
Time to first working prototype 1–3 days using LlamaIndex defaults
Time to production-ready (eval + hybrid + rerank + monitoring) 3–8 weeks of focused work
Practical scale on pgvector Up to ~5–10 million vectors before reaching for a dedicated vector DB

The gap between “first prototype” and “production-ready” is where the real cost lives — and it’s almost entirely engineering time, not infrastructure spend. If you’re tempted to skip the eval harness and the rerank step, don’t. They’re the difference between a demo and a tool people use twice.

Alternatives

Other ways to solve this

ChatGPT Custom GPTs / Claude Projects / Microsoft Copilot grounding. Hosted RAG with no code. Right answer for small corpora, single-team use, low compliance burden. Hits a wall at corpus size, multi-source data, or any need for fine-grained access control or audit logging.

Google NotebookLM. Excellent for individual or small-team research workflows on a curated set of sources (50 sources per notebook on the free tier; 300 on Plus, 600 on Ultra). Not built for organisation-wide deployment, but underrated for personal use.

Search-with-AI services (Glean, Mendable, Vectara, Algolia AI). Hosted enterprise search products that abstract away the infrastructure choices. Worth evaluating when the engineering time saved exceeds vendor lock-in cost and the per-seat pricing — typically $20–$50/seat/month.

Long-context-window approach. For corpora under ~1 million tokens, paste the whole thing into the prompt and skip RAG entirely. Modern flagship models handle this fine. The infrastructure cost is zero; per-query cost is higher because you’re paying for the full corpus on every question. Break-even depends on query volume.

Common questions

FAQ

Do we really need a reranker, or is hybrid search enough?

Hybrid search alone is a meaningful improvement over pure vector. Adding a reranker on top is another big jump — published benchmarks consistently show the rerank step catching cases hybrid misses, especially for ambiguous queries. The marginal cost is one extra API call per query (Cohere Rerank is ~$0.002 per search at $2 per 1,000). For any system real users depend on, ship the reranker.

Should we self-host the embedding model to save cost?

Almost never, at the start. text-embedding-3-small is $0.02 per million tokens — embedding a 25-million-word corpus costs about $3. The operational cost of self-hosting an embedding model on GPU infrastructure dwarfs that for most teams. Reach for self-hosted embeddings (e.g. bge-large-en-v1.5 or nomic-embed-text) only when you have a privacy requirement that prohibits cloud calls, or when you're embedding hundreds of millions of documents.

How do we handle access control — different documents for different users?

Store each document's permission scope as metadata in the vector database, then filter the search by the requesting user's permissions before retrieval. Both pgvector and dedicated vector DBs support metadata filtering. The mistake to avoid: filtering after retrieval. Filter at the query level, or you'll leak the existence of restricted documents through ranking artifacts.

What about answers from PDFs with tables, images, or scanned text?

Tables and images are where naive pipelines lose the most accuracy. Use a structure-aware parser like LlamaParse, Unstructured, or pdfplumber to keep tables intact. For scanned PDFs, run OCR first (Tesseract for free; Azure Document Intelligence or AWS Textract for accuracy on messy scans). Treat the parsing step as a real piece of work, not a one-line library call.

How often do we need to re-index?

Depends on document churn. Most teams settle on a nightly incremental update — only changed documents get re-embedded. Full rebuilds happen quarterly or when you change embedding model, chunking strategy, or contextual-retrieval prompts. Build incremental indexing in from day one; retrofitting it later is painful.

Can we do this without writing Python?

Yes for small scale: ChatGPT Custom GPTs, Claude Projects, NotebookLM, or Microsoft Copilot all let you upload documents and chat with them. None scale comfortably past a few hundred documents or support real access control. Once the corpus or the access requirements grow, you'll write code — or hire someone who will.

Sources & references

Change history (1 entry)
  • 2026-05-10 Initial publication.