Build a private knowledge base your team can search

Q: Should we self-host the embedding model to save cost?

Almost never, at the start. text-embedding-3-small is $0.02 per million tokens — embedding a 25-million-word corpus costs about $3. The operational cost of self-hosting an embedding model on GPU infrastructure dwarfs that for most teams. Reach for self-hosted embeddings (e.g. bge-large-en-v1.5 or nomic-embed-text ) only when you have a privacy requirement that prohibits cloud calls, or when you're embedding hundreds of millions of documents.

A new employee asks: “What’s our refund policy for enterprise contracts signed before 2024?” The answer lives in a Notion page somewhere, possibly also in a PDF in shared drive, and three Slack threads from last year. Finding it takes 20 minutes. With a private knowledge base — sometimes called RAG (retrieval-augmented generation, which means letting the AI search your documents before answering) — that same question gets a grounded answer in seconds, with links to the source pages.

If you’ve read RAG explained without acronyms and decided your team genuinely needs one, this is the practical follow-up. We’ll set up a “chat with our docs” system that handles a real corpus, returns grounded answers with citations, and is honest enough to say “I don’t know” when the answer isn’t in your data.

This is opinionated — the goal is the shortest path from zero to a working system you’d let a colleague use. Variations are noted along the way.

When to use

Where this solution fits — and where it doesn't

Use this if you have a substantial corpus (think hundreds of pages and up — wiki pages, PDFs, contracts, support tickets, transcripts), the questions people ask are open-ended, and the answers depend on the specific contents of those documents. Examples that fit naturally: an internal handbook search, a “find precedents in our past projects” tool, a customer-facing support assistant grounded in your help docs.

Don’t use this if your corpus is small enough to fit in a single prompt (under ~1 million tokens — modern flagship models can hold an entire library shelf), the questions are highly structured (“what’s the price of product X?” — that’s SQL), or the information changes by the second (live system status, market data — use an API call, not a retrieval index).

Prerequisites

What you'll need before starting

A Python environment, or comfort installing one (3.11+ recommended).
An API key from an LLM vendor (OpenAI, Anthropic, or Google) — you’ll use one for embeddings and one for generation.
Existing Postgres or willingness to run one — pgvector turns it into a vector database for free.
Your documents in a form you can read programmatically — files in a folder, an S3 bucket, a Notion export, a Confluence dump. Ugly is fine; messy is fine; gigabytes are fine. Locked behind a login you don’t have is not fine — solve access first.

The solution

Six steps to a working knowledge base

Pick your stack — and default to the boring one
Use LlamaIndex as the framework, pgvector on Postgres for storage, OpenAI text-embedding-3-small for embeddings, and Claude Sonnet or GPT-5 for generation. Reasons: LlamaIndex requires meaningfully less code than LangChain for a basic RAG pipeline, pgvector lives inside infrastructure you already run, and text-embedding-3-small is cheap and good enough for nine corpora out of ten. Resist the urge to mix vendors at this stage — pick one, ship, then optimise.
Get a working prototype on a small slice — same day
Take 50–100 representative documents (not your whole corpus), run them through LlamaIndex’s VectorStoreIndex.from_documents, point it at pgvector, and ask three real questions you’d want answered. Look at what comes back. If the answers are usable, you’ve validated the corpus is RAG-friendly. If they’re nonsense, the problem is almost always the documents themselves — and no amount of tuning will fix garbage input.
Choose a chunking strategy that respects document structure
The default fixed-size chunker (e.g. 512 tokens with 50-token overlap) is a fine starting point but loses a lot. Switch to a structure-aware splitter as soon as it matters: split on headings for Markdown / wiki content, split on sentences for prose, keep tables intact, keep code blocks intact. LlamaIndex ships several. The biggest reliability bump beyond defaults: contextual retrieval (Anthropic’s technique linked below) — prepend a one-line LLM-generated summary of the parent document to each chunk before embedding. Reduces retrieval failures by ~35% in their published benchmarks; the cost is one cheap LLM call per chunk at index time.
Add hybrid search and a reranker — this is the big quality jump
Naive RAG uses pure vector similarity. The 2026 consensus across vendor write-ups is consistent: hybrid search (vector + classical keyword via BM25) plus a reranker (a small specialised model that reorders the top-N candidates) is the single biggest quality improvement you can make. Concretely: retrieve the top 50–100 candidates via hybrid search, pass them to Cohere Rerank or a similar model, keep the top 5–10 for the LLM. This step is the difference between “the demo works” and “users actually use it.”
Build a small evaluation set — even fifty examples beats vibes
Write down 30–100 question-and-expected-answer pairs covering the kinds of questions your users will actually ask. Note which document(s) contain each answer. Now you can measure: did retrieval find the right document? Did generation use it correctly? Without this, you’ll spend months tuning blindly. Re-run the eval after every change. The eval set is the most valuable artefact you’ll build — back it up.
Force citations and a “don’t know” path
Instruct the model: “Answer using only the passages below. Cite the passage you used for each claim. If the passages don’t contain the answer, say so — do not guess.” Then surface those citations to the user as clickable links to the source document. Two effects, both critical: users can verify the answer (so they trust the system enough to use it), and the model is much less likely to fabricate when it’s instructed to cite. The “don’t know” path is the difference between a knowledge base and a confident-bullshit machine.

Cost & constraints

What this actually costs to run

Embedding cost — one-time index build ~$0.02 per million tokens (≈ $0.50–$3 to embed a 25-million-word corpus)

Vector DB — pgvector on existing Postgres $0 incremental

Vector DB — managed (Pinecone / Qdrant / Weaviate cloud) $20–$700/month depending on scale and vendor (Pinecone Builder $20/month, Standard from $50/month, Enterprise from $500/month)

Per-question retrieval cost Fractions of a cent (one embedding + one DB query + one rerank call)

Per-question generation cost $0.001–$0.05 typical (depends on context size and model)

Time to first working prototype 1–3 days using LlamaIndex defaults

Time to production-ready (eval + hybrid + rerank + monitoring) 3–8 weeks of focused work

Practical scale on pgvector Up to ~5–10 million vectors before reaching for a dedicated vector DB

The gap between “first prototype” and “production-ready” is where the real cost lives — and it’s almost entirely engineering time, not infrastructure spend. If you’re tempted to skip the eval harness and the rerank step, don’t. They’re the difference between a demo and a tool people use twice.

In practice

What teams running this typically learn first

The thing teams underestimate isn’t the tech, it’s the document quality. Most internal corpora include outdated pages, near-duplicates with conflicting answers, half-finished drafts, and pages that exist only because nobody dared delete them. RAG faithfully retrieves and surfaces all of it. Plan for a documentation cleanup pass before launch — not after — or build in a “this answer is likely outdated” signal driven by document age and edit frequency.

The second surprise: the most-asked questions are almost never what you expected. Log every query from day one — anonymised if needed — and look at the list weekly. The patterns will tell you which sections of the corpus need to be rewritten for clarity, which questions need a structured answer surface (a table, a calculator, a flowchart) instead of free-text retrieval, and which questions you simply don’t have content for. The query log is the roadmap.

Alternatives

Other ways to solve this

ChatGPT Custom GPTs / Claude Projects / Microsoft Copilot grounding. Hosted RAG with no code. Right answer for small corpora, single-team use, low compliance burden. Hits a wall at corpus size, multi-source data, or any need for fine-grained access control or audit logging.

Google NotebookLM. Excellent for individual or small-team research workflows on a curated set of sources (50 sources per notebook on the free tier; 300 on Plus, 600 on Ultra). Not built for organisation-wide deployment, but underrated for personal use.

Search-with-AI services (Glean, Mendable, Vectara, Algolia AI). Hosted enterprise search products that abstract away the infrastructure choices. Worth evaluating when the engineering time saved exceeds vendor lock-in cost and the per-seat pricing — typically $20–$50/seat/month.

Long-context-window approach. For corpora under ~1 million tokens, paste the whole thing into the prompt and skip RAG entirely. Modern flagship models handle this fine. The infrastructure cost is zero; per-query cost is higher because you’re paying for the full corpus on every question. Break-even depends on query volume.

Common questions

FAQ

Do we really need a reranker, or is hybrid search enough?

Hybrid search alone is a meaningful improvement over pure vector. Adding a reranker on top is another big jump — published benchmarks consistently show the rerank step catching cases hybrid misses, especially for ambiguous queries. The marginal cost is one extra API call per query (Cohere Rerank is ~$0.002 per search at $2 per 1,000). For any system real users depend on, ship the reranker.

Should we self-host the embedding model to save cost?

Almost never, at the start. text-embedding-3-small is $0.02 per million tokens — embedding a 25-million-word corpus costs about $3. The operational cost of self-hosting an embedding model on GPU infrastructure dwarfs that for most teams. Reach for self-hosted embeddings (e.g. bge-large-en-v1.5 or nomic-embed-text) only when you have a privacy requirement that prohibits cloud calls, or when you're embedding hundreds of millions of documents.

How do we handle access control — different documents for different users?

Store each document's permission scope as metadata in the vector database, then filter the search by the requesting user's permissions before retrieval. Both pgvector and dedicated vector DBs support metadata filtering. The mistake to avoid: filtering after retrieval. Filter at the query level, or you'll leak the existence of restricted documents through ranking artifacts.

What about answers from PDFs with tables, images, or scanned text?

Tables and images are where naive pipelines lose the most accuracy. Use a structure-aware parser like LlamaParse, Unstructured, or pdfplumber to keep tables intact. For scanned PDFs, run OCR first (Tesseract for free; Azure Document Intelligence or AWS Textract for accuracy on messy scans). Treat the parsing step as a real piece of work, not a one-line library call.

How often do we need to re-index?

Depends on document churn. Most teams settle on a nightly incremental update — only changed documents get re-embedded. Full rebuilds happen quarterly or when you change embedding model, chunking strategy, or contextual-retrieval prompts. Build incremental indexing in from day one; retrofitting it later is painful.

Can we do this without writing Python?

Yes for small scale: ChatGPT Custom GPTs, Claude Projects, NotebookLM, or Microsoft Copilot all let you upload documents and chat with them. None scale comfortably past a few hundred documents or support real access control. Once the corpus or the access requirements grow, you'll write code — or hire someone who will.

Where this solution fits — and where it doesn't

What you'll need before starting

Six steps to a working knowledge base

What this actually costs to run

Other ways to solve this

FAQ

Do we really need a reranker, or is hybrid search enough?

Should we self-host the embedding model to save cost?

How do we handle access control — different documents for different users?

What about answers from PDFs with tables, images, or scanned text?

How often do we need to re-index?

Can we do this without writing Python?

Sources & references

Related solutions

Audit-trail generation from system logs

Auto-categorize support tickets by topic and urgency

Auto-generate documentation from PRs and code

Automated invoice and receipt processing