Cyberax AI Playbook
cyberax.com
Explainer · Foundations

AI hallucinations explained

Why LLMs — the technology behind ChatGPT, Claude, and Gemini — confidently produce wrong answers, how often it happens in 2026 (the honest answer is "more than vendors imply"), and the four mitigations that actually move the needle in production systems.

At a glance Last verified · May 2026
Problem solved Understand why LLMs hallucinate, how often it actually happens, and which mitigations are worth the engineering effort
Best for Anyone deploying LLMs in production — founders, engineers, product managers, ops leads, compliance teams
Tools ChatGPT, Claude, Gemini
Difficulty Beginner

A lawyer asks ChatGPT for case law to support a brief. The AI returns three real-sounding citations to support the argument — author, court, year, and a quote. Two of them don’t exist. The lawyer files the brief; the judge checks; the lawyer gets sanctioned. This pattern has a name: hallucination. It’s the most expensive AI failure mode to discover in production, and the one vendors quietly downplay on their marketing pages.

This explainer covers three things. Why hallucinations happen (which determines what fixes work). How often they actually occur in May 2026 (the honest range, not the vendor-friendly cherry-picked number). And the four mitigations that meaningfully reduce them in production systems.

What it actually is

Why a model that 'predicts text' produces wrong answers

To understand hallucination you need one fact about how LLMs work: they generate text by predicting which token comes next, given everything that came before. They have no separate “is this true?” check. The same mechanism that produces “Paris is the capital of France” produces “Paris is the capital of Germany” — both are statistically plausible continuations of “Paris is the capital of.” One happens to be true; the other happens to be false; the model has no way to distinguish between them at generation time.

Hallucination, then, isn’t a bug or a malfunction. It’s the system working as designed, applied to a question where the most-probable continuation isn’t the most-true continuation. (For the underlying mechanics, see What an LLM actually does for a business.)

Three patterns this produces in practice:

  • Plausible-but-wrong facts. Made-up dates, citations, statistics, quotes. The model has seen enough examples of “Author published Title in Year” structure that it confidently generates one even when no real author published that real title in that real year.
  • Confidently-wrong reasoning. The model walks through a multi-step argument that looks valid and reaches a wrong conclusion — often by accepting a false premise as true and then reasoning correctly from that false premise.
  • Confabulation under pressure. When asked something the model doesn’t know, it produces an answer-shaped output rather than admitting uncertainty. This is the most expensive pattern in production because the answer is rarely flagged as low-confidence.
The numbers

How often this actually happens in May 2026

The honest answer requires a caveat: hallucination rate depends entirely on the benchmark. Different benchmarks measure different things, and the published numbers from vendors are usually picked from the most favourable test. The truthful picture combines several:

Vectara summarisation hallucination — best non-specialised models (GPT-5.4-nano, Gemini 2.5 Flash Lite) ~3%
Vectara summarisation hallucination — Claude family (Haiku 4.5 to Opus 4) ~10–12%
Vectara summarisation hallucination — typical flagship 5–10%
AA-Omniscience knowledge benchmark — Claude Opus 4.7 36% hallucination rate (with strongest calibration of any frontier model)
AA-Omniscience knowledge benchmark — Gemini 3.1 Pro Top score; still hallucinates on a meaningful fraction of complex factual queries
Real-world interactions — 2025 peer-reviewed survey ~31% of standard queries; rises to ~60% in complex / specialised domains
Improvement from RAG (retrieval grounding) ~71% reduction in hallucinations on average
Improvement from structured-output prompting ~22 percentage points reduction (2025 Nature study)
Improvement from "are you hallucinating?" self-check prompt ~17% reduction in subsequent responses
Improvement from multi-model ensemble (small model fact-checks larger model) Catches 30–50% of hallucinations at minimal extra cost

The headline numbers vendors quote — “under 5% hallucination” — are usually from summarisation benchmarks where the model has the source material in its prompt. That’s the easy case. On knowledge benchmarks where the model has to recall facts from its training, the rates are several times higher across all flagship models. Plan for the harder case in production.

What works

The four mitigations that actually move the needle

In rough order of effectiveness:

1. Retrieval-augmented generation (RAG). By far the highest-leverage mitigation for any factual workload. Instead of asking the model to recall facts from training, give it the relevant facts at the time of the question. The model is far better at using facts in the prompt than recalling them from weights. RAG cuts hallucination rates by ~71% on average across published benchmarks. Build cost is real (see Build a private knowledge base and RAG explained without acronyms) but pays back fast for any production system.

2. Forced citation + “don’t know” path. Instruct the model to cite the specific source for each claim and to say “I don’t know” when sources don’t support an answer. Surface the citations to users as clickable links. Two effects: the model is much less likely to fabricate when it must show its work, and users can verify before acting. This is the difference between a knowledge base and a confident-bullshit machine. Costs almost nothing.

3. Structured outputs and tool use for anything deterministic. Don’t let the model do arithmetic, date math, code execution, or database lookups in its head. Use a tool — actual code that the model calls. Modern flagship models support structured outputs (JSON schemas) and tool use natively; both reduce specific classes of hallucination to zero. (Math errors disappear when the model uses a calculator instead of generating digits.)

4. Validation layer — schema and business rules. Even with RAG and structured output, validate every result before it has consequences. Schema validation catches type-level wrongness (a date that isn’t a date); business-rule validation catches semantic wrongness (a date in the future that should be in the past, an invoice total that doesn’t sum from line items). Anything that fails validation goes to a human-review queue. This is the safety net that catches the cases the other three mitigations miss.

What doesn’t work as well as people hope: telling the model “be accurate” or “don’t hallucinate” in the system prompt. The instruction lands as a vibes adjustment, not a structural change. It helps a little; the four mitigations above help much more.

Where it matters most

Knowing when to invest in mitigation

Not every workload needs the full mitigation stack. The investment scales with the cost of being wrong:

  • Low-stakes drafting (marketing copy, brainstorm, casual chat) → minimal mitigation needed. Editorial review catches most errors. The hallucination rate is a quality issue, not a safety issue.
  • Medium-stakes user-facing answers (help-desk assistant, FAQ bot, customer-facing chatbot) → RAG + forced citations are mandatory. Without them, users lose trust the first time the bot makes something up — and they will.
  • High-stakes factual workloads (legal extraction, medical information, financial calculations, compliance queries) → all four mitigations, plus human review on every output. Treat the model as a draft assistant that produces structured output; never as the deciding authority. (See When AI is the wrong tool for the categories where this gets dangerous.)
  • Anything with irreversible consequences (sending email, executing trades, deleting files, posting publicly) → never let the model take the action without an explicit human approval gate. Treat this as a separate, much higher-risk category.
Common questions

FAQ

Why don't vendors just fix this?

They're trying — and have been making real progress over the past two years. But hallucination is a fundamental property of how LLMs generate text, not a discrete bug. Each new model generation reduces the rate; none has eliminated it; the foundational research community broadly does not expect any to in the near term. Plan around it, not for its disappearance.

Will reasoning models (o-series, Claude extended thinking, Gemini Deep Think) hallucinate less?

Sometimes yes — they can think through whether their answer is consistent before committing. But reasoning models also confabulate at length: when they get a fact wrong, they can produce paragraphs of fluent justification for the wrong fact. The presence of a reasoning trace makes the wrong answer harder to spot, not easier. Net effect: somewhat lower hallucination rates with somewhat higher cost-per-detection.

Is one model meaningfully more honest than the others?

On most public benchmarks the gap between the top vendors shifts with each release. As of May 2026, lightweight OpenAI and Google models (GPT-5.4-nano, Gemini 2.5 Flash Lite) lead the Vectara summarisation leaderboard at ~3%; Claude models cluster around 10–12% on that benchmark. Gemini 3.1 Pro tops the AA-Omniscience knowledge benchmark; Claude Opus 4.7 has the strongest calibration (knowing what it doesn't know). For a specific use case, run your own evaluation against your own questions — vendor leaderboards don't predict your workload.

How do I tell when a specific output is hallucinated?

Without retrieval grounding, you can't reliably — that's the whole problem. Three weak signals: (1) check whether named entities (people, places, citations) are real, (2) check whether numbers are internally consistent, (3) ask the model the same question in two different ways and see if the answers match. None are foolproof. The structural fix is to make the model show its sources via RAG, not to verify each output post-hoc.

Can I just use a smaller model to check the bigger model?

Yes, and it works better than you'd expect. The Uncertainty-Aware Fusion technique published in 2025 shows that ensembling a small fact-checker with a large generator catches 30–50% of hallucinations at minimal additional cost. The pattern: large model generates the answer, small model rates the confidence, low-confidence outputs are flagged for review or regeneration. Increasingly common in production.

What's the worst case I should plan for?

Plausibly-formatted citations to non-existent papers, books, court cases, or research. This pattern is well-documented (lawyers have been sanctioned for filing AI-generated briefs with fabricated case citations). Any workflow that generates references must verify them — never let citations to specific sources reach a production document without a real check that those sources exist.

Sources & references

Change history (1 entry)
  • 2026-05-10 Initial publication.