Fine-tuning vs RAG vs prompt engineering: which when

Three techniques can change what an AI model does for you, in order of cost and complexity:

Prompt engineering means rewriting your instructions until the model produces what you want. You change the question; the model stays the same.
RAG — short for retrieval-augmented generation — means giving the AI access to your specific documents so it can answer from your information rather than guessing from its training data. You change what the model can see; the model still stays the same.
Fine-tuning means training a model further on your own examples so its behaviour shifts permanently. You change the model itself.

“Should we fine-tune the model?” is the most expensive question an operator can ask without realising it. The honest answer in 2026 is almost always no — not because fine-tuning is bad, but because the question usually arrives before the cheaper alternatives have been tried.

The three techniques form a ladder. You climb only when the rung below has demonstrably failed. What follows is the decision tree for the operator deciding which one (if any) their team should reach for.

The three rungs

What each technique actually changes

Prompt engineering changes what you ask. The model stays the same; the instructions, examples, and structure of your request improve. This is what most people are doing when they say they’re “using ChatGPT.” Good prompts have a clear role for the model, explicit constraints, worked examples (few-shot prompting), and a defined output format. Setup cost: zero. Iteration cost: minutes per try. Reversible: completely.

RAG changes what the model sees. The model stays the same; before each answer, a retrieval step pulls relevant passages from your own data and pastes them into the prompt. The model is now answering an open-book exam on your binder rather than a closed-book exam on its memory. Setup cost: days to weeks. Per-question cost: fractions of a cent for retrieval plus standard generation. Reversible: you can swap retrievers or turn it off. (For the mechanics, see RAG explained without acronyms.)

Fine-tuning changes the model itself. You take an existing model and continue its training on a curated dataset of input/output pairs — typically thousands to tens of thousands of examples. The model’s weights shift toward the patterns in your data. Setup cost: weeks to months. Maintenance cost: every base-model upgrade requires re-training. Reversible: only by reverting to the un-fine-tuned model.

Two flavours of fine-tuning worth knowing about:

Full fine-tuning of a frontier proprietary model — OpenAI’s reinforcement fine-tuning, custom enterprise programs at OpenAI and Anthropic. Heavy, expensive, rare. Five figures and up; only worth it when the workload is large and the model genuinely needs to internalise patterns that prompts and retrieval can’t teach.
LoRA / QLoRA adapters on top of open-weights models (Llama, Mistral, Qwen). A small “adapter” layer is trained while the base model’s weights stay frozen. Two to four figures of cost, runs on a single GPU, swappable per use case. This is what most “we fine-tuned a model” projects in 2026 actually mean.

The decision rule

Climb only when the rung below has failed

The ladder, in order:

Try prompt engineering first. Always. Reframe the task, add few-shot examples, tighten output constraints, give the model an explicit role. Most things that feel like “the model is dumb” turn out to be “the prompt was underspecified.” A team that hasn’t spent a week on prompt iteration cannot tell whether fine-tuning is needed — they haven’t built a baseline.
Climb to RAG when the problem is missing information. If the model is wrong because it doesn’t know your data, your post-cutoff facts, your proprietary context — that’s a retrieval problem, not a training problem. Fine-tuning is the wrong tool for “make the model know this fact.” Use RAG.
Climb to fine-tuning when prompt + RAG demonstrably fall short and you have data. Specifically: you need consistent tone or format across thousands of outputs, prompt instructions aren’t holding the line, RAG has nothing to retrieve (because the task is generative, not retrieval), and you have ≥1,000 high-quality input/output examples sitting in a usable format.

If you’ve skipped a rung, you’ve almost certainly mis-spent the budget. The pattern in the post-mortems is consistent: teams that fine-tuned without first exhausting prompt engineering ended up with a model that was different but not better, plus a maintenance burden that compounds with every base-model release.

The numbers

What each rung actually costs

Prompt engineering — setup $0; hours of iteration time

Prompt engineering — per-call cost Standard model rates ($0.001–$0.05 per call depending on tier)

RAG — first working prototype 1–3 days using a framework like LlamaIndex

RAG — production-ready (eval, hybrid search, re-rank) 3–8 weeks of engineering time

RAG — per-question cost Fractions of a cent for retrieval; $0.001–$0.05 for generation

Fine-tuning — LoRA on open-weights (Llama / Mistral, single GPU) $20–$500 per training run; $200–$2,000 for a serious effort with data curation

Fine-tuning — OpenAI proprietary (small models) $3–$25 per million training tokens; $50–$5,000 typical end-to-end

Fine-tuning — proprietary frontier (custom enterprise programs) Five to seven figures; usually only with vendor partnership

Fine-tuning — re-training cycle on each base-model upgrade Multiply per-run cost by the number of upgrade cycles you plan to follow

Minimum dataset size for fine-tuning to beat a strong prompt ~500–1,000 examples for narrow tasks; 5,000–50,000 for general tone or style

Two patterns are worth pulling out of those numbers.

First, the cost gap between rungs is roughly an order of magnitude each step. Prompt engineering is effectively free. RAG is hundreds to low-thousands of dollars one-time plus per-call costs. Fine-tuning is thousands to tens of thousands plus ongoing maintenance. Climbing one rung when you didn’t need to is rarely catastrophic; climbing two is.

Second, the reversibility cost compounds the dollar cost. A bad prompt is replaced in a minute. A bad RAG architecture is replaced in a sprint. A bad fine-tune means writing off the training spend, the dataset curation work, and the eval harness — and often re-doing them on the base model you should have started with.

A fourth lever

Tool use / function calling — solves a different problem

There’s a fourth lever you’ll hear about in the same conversations: tool use or function calling — where the model decides when to call an external function (a database query, a calculator, a web search, an API) and uses the result in its answer. Agent frameworks (LangGraph, AutoGen, Anthropic’s tool-use API, OpenAI’s responses API) sit on top of this.

It’s not on this ladder because it solves a different problem. The three techniques above are about getting better answers. Tool use is about taking actions or fetching live data. You can — and often should — combine them: a prompt-engineered system that uses tools for calculations and RAG for grounding. But picking tool use instead of the three rungs above is a category error; it doesn’t replace any of them.

If your problem is “the model needs to take an action, check a live system, or do exact math” — that’s tool use. Treat it as orthogonal to this decision, not a fourth option on the ladder.

Where each rung fails

The failure modes that bite teams in production

Prompt engineering fails when the prompt has to encode patterns that don’t fit in a few-shot examples block. If your output spec is 8,000 tokens of edge cases and you keep adding more, that’s a signal the prompt has hit its ceiling. Long, brittle prompts also have their own bill of materials — every token re-sent on every call adds up. (See Tokens, context windows, and what they cost for the arithmetic.)

RAG fails when retrieval misses the right passage. The 2026 consensus from vendor write-ups and academic post-mortems is consistent — when RAG goes wrong in production, it’s almost always a retrieval failure (right answer / wrong passage, multi-hop questions, lost context from poor chunking), not a generation failure. RAG also fails when the task is generative and there’s nothing to retrieve — “write me a marketing email in our brand voice” is not a RAG problem.

Fine-tuning fails when the dataset is small, noisy, or unrepresentative of production traffic; the base model upgrades and the fine-tune drifts; or the team can’t measure improvement (no eval harness). Fine-tuning produces a model that’s different; whether it’s better is a question only an eval harness can answer, and most teams don’t build one before training. The most common production outcome of a rushed fine-tune is a model that’s slightly worse on most tasks and slightly better on the narrow training distribution — a net regression.

What's next

If you've decided which rung you're on

If the answer is prompt engineering, your next read isn’t this playbook — it’s the vendor’s own prompt engineering guide. Anthropic and OpenAI both maintain good ones (linked in Sources below). Build a small set of test cases first; iterate against them.
If the answer is RAG, start at RAG explained without acronyms for the mechanics, then Build a private knowledge base your team can search for the practical setup.
If the answer is fine-tuning, the first homework isn’t training — it’s building an evaluation harness on your existing prompt-engineered baseline. If you can’t measure whether the fine-tune improved things, the training run is a guess. The OpenAI fine-tuning guide and the Hugging Face PEFT library (both linked) are the right starting points for proprietary and open-weights routes respectively.

The cost framing in Tokens, context windows, and what they cost is the companion to this piece — once you’ve picked the technique, that’s how to budget the workload.

Common questions

FAQ

Why is prompt engineering always the first rung if it sounds the most amateur?

Because most teams skip it and pay for the consequence. Prompt engineering is the baseline — without a strong prompt-engineered version of the task, you cannot tell whether RAG or fine-tuning improved anything. It also has the highest cost-adjusted leverage: a good prompt iteration costs minutes and frequently moves the outcome from "this is failing" to "this is working." The teams that look most sophisticated about AI are usually the ones who got a lot out of prompts first.

Can RAG and fine-tuning be combined?

Yes, and at scale they often are. A fine-tuned model trained to produce a specific output format, drawing on retrieved passages for the actual facts, is a common production pattern. The order matters: build RAG first (you'll learn what your data actually looks like), then fine-tune on top once you've established that prompts alone can't carry the format or style. Combining them before you've proven RAG-alone is insufficient just multiplies the maintenance surface.

Is LoRA / QLoRA 'real' fine-tuning?

Yes — and for most teams who do fine-tune in 2026, it's the only flavour worth considering. LoRA freezes the base model's weights and trains a small adapter layer; QLoRA does the same with the base model held in a quantised (lower-precision) form so it fits on cheaper hardware. The result is a fine-tune you can run for tens to low-thousands of dollars instead of five figures, swap per use case, and abandon cheaply if it doesn't work. See the Hugging Face PEFT library (linked in Sources) for the standard toolchain.

We have a brand-voice problem. Should we fine-tune?

Probably not yet. Brand-voice problems frequently dissolve with a sharper prompt — give the model five strong examples of the voice you want, a list of phrases to avoid, and an explicit description of register and reading level. If that doesn't get you 80% of the way, then a small fine-tune on a few hundred curated examples can push the last 20%. Teams that fine-tune for brand voice before exhausting prompts usually end up with a model that has a different voice but not the right one.

How long until the answer here changes?

This is a low-volatility piece. The relative ordering of the three rungs — prompt first, RAG when information is missing, fine-tune when both fall short — has been stable since 2023 and the underlying economics still support it in 2026. What changes faster is the costs (model prices drop, context windows grow) and the tooling (PEFT methods improve, RAG frameworks mature). The decision rule will outlast the specific numbers.

What each technique actually changes

Climb only when the rung below has failed

What each rung actually costs

Tool use / function calling — solves a different problem

The failure modes that bite teams in production

If you've decided which rung you're on

FAQ

Why is prompt engineering always the first rung if it sounds the most amateur?

Can RAG and fine-tuning be combined?

Is LoRA / QLoRA 'real' fine-tuning?

We have a brand-voice problem. Should we fine-tune?

How long until the answer here changes?

Sources & references

Related solutions

AI hallucinations explained

AI privacy — what to watch for

AI procurement checklist for non-technical buyers

AI risk assessment for legal and compliance teams