Tokens, context windows, and what they cost

Q: If context windows are now huge, do I still need RAG?

Often yes, for cost reasons. A 1M-token window means you can paste a million tokens of context every call — but at ~$3 per million input tokens (mid-tier flagship), every call costs $3 just in input. RAG retrieves the relevant slice (typically 5–20k tokens) and pays cents instead of dollars. Use long-context for one-off analyses; use RAG for high-volume Q&A. See RAG explained without acronyms .

If you’ve tried to budget an AI feature and felt like the vendor pricing pages were written for a different audience, you’re not wrong — they were. This piece is the translation layer. By the end, you’ll be able to look at any model’s pricing page and produce a realistic estimate of what a workload will cost, without a calculator and without help from your engineer.

Two concepts explain almost everything: the token (the chunks of text an AI reads and writes — roughly 3/4 of a word per token) and the context window (how much text the model can consider at once). Once those are clear, the pricing math is arithmetic.

What is a token

The unit you're billed in

A token is the smallest piece of text a language model processes. It’s not exactly a word, and it’s not exactly a character — it’s something in between, determined by the model’s tokenizer.

Useful rule of thumb for English:

1 token ≈ ¾ of a word, or about 4 characters
100 tokens ≈ 75 words ≈ a short paragraph
1,000 tokens ≈ 750 words ≈ a long page
100,000 tokens ≈ a 250-page book

For most languages other than English, the ratio is worse — non-English text often takes 1.5–3× more tokens to express the same content. A consequence: serving a multilingual user base costs more per question than the English-only equivalent.

You can see this for yourself by pasting text into OpenAI’s tokenizer — useful for sanity-checking estimates and seeing where token count surprises you.

What is a context window

The model's working memory

A context window is the maximum number of tokens a model can hold in mind at once — your prompt, the model’s output, any retrieved documents, conversation history. Everything that happens in one model call must fit in the window.

In May 2026, flagship models from all three major vendors offer context windows of 1 million tokens or more. Claude Sonnet 4.6, Claude Opus 4.7, Gemini 3.1 Pro, and Gemini 3 Flash Preview all sit at 1M; OpenAI’s GPT-5.5 reaches ~1.05M, with prices doubling on input tokens above the 272k threshold.

A 1M-token window is roughly the entire text of War and Peace plus a few hundred pages of footnotes. For most business uses, the window is no longer the binding constraint — you’ll hit cost or coherence issues before you fill it.

Two important caveats on long context windows:

Attention degrades over very long contexts. The model’s ability to recall details from early in the prompt drops as the window fills. Vendor benchmarks show this is improving but is not zero. Don’t paste a million tokens and expect the model to recall a single sentence near the start with perfect fidelity.
You’re paying for the whole window every time. Input cost is per-token; if your prompt includes 500k tokens of context, you pay for 500k input tokens on every call. That adds up fast.

Pricing — input vs output

The split that matters more than people realise

LLM API pricing has two prices, almost always: input (the tokens you send) and output (the tokens the model generates). Output tokens cost more than input tokens — typically 3–5× more.

Why? Output is where the model’s expensive computation happens. Each output token requires the model to “think” about everything that came before it. Input tokens are largely a one-time cost — the model reads them, builds an internal representation, and then can produce many output tokens against that representation.

Practical implications:

Long-input, short-output workloads are cheap. Asking a model to summarise a 100-page document into 200 words: most of the cost is the input pass; the output is small. This is the favourable shape.
Short-input, long-output workloads are expensive. Asking a model to write a 10,000-word essay from a one-sentence brief: the input is tiny, but every output token is at the higher rate. Cost compounds with length.
Conversation-style workloads compound. Each turn of a conversation includes the entire previous history as input. A long chat re-pays the input cost on every turn — by message 30, your input cost per question has multiplied substantially.

The numbers

What this costs in practice

Cheapest viable API tier (Gemini 2.5 Flash-Lite) $0.10 input / $0.40 output per million tokens

Mid-tier flagship (GPT-5.4) $2.50 input / $15 output per million tokens

Mid-tier flagship (Claude Sonnet 4.6) $3 input / $15 output per million tokens

Mid-tier flagship (Gemini 3.1 Pro Preview) $2 input / $12 output per million tokens (≤200k prompt; doubles above)

Premium flagship (Claude Opus 4.7) $5 input / $25 output per million tokens

Premium flagship (GPT-5.5) $5 input / $30 output per million tokens

"Short question, short answer" cost (300 input + 200 output, Sonnet 4.6) ~$0.004 — about 0.4 cents per turn

"Summarise a 50-page document" cost (15k input + 500 output, Sonnet 4.6) ~$0.05 — about 5 cents per summary

"Embed a 25-million-word knowledge corpus" cost (one-time, text-embedding-3-small) ~$0.50–$3 — one-time

"Process 10,000 invoices/month at 5 cents each" cost ~$500/month

Caching discount where available (long unchanged context) 50–90% reduction on cached input tokens

The cache discount is worth knowing about: if you’re sending the same large context repeatedly (e.g. a system prompt with voice samples, a long document being asked many questions), most major vendors now offer prompt caching that reduces the per-token cost of the cached portion by 50–90%. For high-volume workloads, this is the difference between break-even and not.

Worked examples

When pricing actually constrains your project

Three illustrative back-of-envelopes for typical workloads, using mid-tier flagship pricing as a baseline:

A small support assistant — 1,000 questions per day. Average input 1,500 tokens (system prompt + retrieved context), average output 250 tokens. Per question: ~$0.008. Daily: $8. Monthly: ~$240. Verdict: trivial.

An internal knowledge-base assistant for a 100-person company — 5,000 questions per day. Input 4,000 tokens (longer retrieved context), output 400 tokens. Per question: ~$0.018. Daily: $90. Monthly: ~$2,700. Verdict: planning constraint at the high end of “small,” budget item at scale.

A document-processing pipeline — 100,000 PDFs per month at 5 pages each. Input ~2,500 tokens per page, output ~500 tokens. Per page: ~$0.015. Per document: ~$0.075. Monthly: $7,500. Verdict: a real line item; worth optimising or moving to specialised OCR (see Extract structured data from PDFs at scale).

An agent doing multi-step reasoning — 10,000 sessions per day, 8 turns each. This is where it gets expensive. Each turn re-pays input cost; output tokens accumulate. Conservative estimate: $0.50–$2 per session. Daily: $5,000–$20,000. Monthly: $150k–$600k. Verdict: agents at scale are a different cost regime; budget accordingly or pick smaller models for sub-tasks.

Pattern: for most “AI as a tool” use cases, the bill is small. For “AI in the hot path of a high-volume product,” the bill becomes a real engineering constraint and shapes what’s worth building.

What to watch for

Where pricing assumptions break

Vendor pricing pages quote per-million-tokens; your invoice is per-token. A 30% increase in average output length on a high-volume workload shows up as a 30% increase in monthly bill. Monitor your average output length, not just call count.
Reasoning modes (o-series, Claude extended thinking, Gemini Deep Think) bill the model’s “thinking” tokens. These can be 10–50× more output tokens than the user-visible response. Worth it for hard problems; expensive for routine ones.
Image and audio inputs are tokenised differently. Each image is roughly 1,000–2,000 tokens depending on resolution; audio input is also priced per-token, with conversion ratios that vary by vendor. Check the specific vendor docs for multimodal pricing.
Caching only helps if your context is genuinely repeated. A new context per call doesn’t benefit. Architect your prompts so the long, unchanging part comes first.
Free tiers have hidden costs. “Free” usually means rate-limited and slow; the per-call cost is being subsidised. If you build a product on a free tier, your scaling story breaks at the moment it works.

Common questions

FAQ

Why does the same prompt sometimes use a different number of tokens?

Output is the variable part — the model decides how long to respond. Input is deterministic for a given tokenizer and prompt, but two different vendors will tokenize the same text into slightly different counts (their tokenizers differ). When you compare costs across vendors, compare the cost of the work, not the token count.

How do I keep costs down without giving up quality?

Three reliable moves: (1) use the smallest model that's good enough for the task — most workloads don't need flagship; (2) shorten the system prompt and only include retrieval results that are actually relevant; (3) cap output length explicitly with max_tokens and length instructions in the prompt. The combination usually halves cost without users noticing.

Why is the consumer chat product flat-priced if the API is per-token?

Consumer plans are subsidised — vendors are buying market share at the consumer tier and making it back at the API and enterprise tiers. The same model accessed via API costs roughly what its tokens cost to produce; via consumer interface, you're getting it at a planned loss. This is one reason consumer plans have rate limits and the API does not.

If context windows are now huge, do I still need RAG?

Often yes, for cost reasons. A 1M-token window means you can paste a million tokens of context every call — but at ~$3 per million input tokens (mid-tier flagship), every call costs $3 just in input. RAG retrieves the relevant slice (typically 5–20k tokens) and pays cents instead of dollars. Use long-context for one-off analyses; use RAG for high-volume Q&A. See RAG explained without acronyms.

How do I forecast cost before I build?

Take your expected daily request count, estimate average input and output tokens per request (run a few real examples through the OpenAI tokenizer), and multiply by per-million-token prices. Multiply by 1.3–1.5 for safety margin (real workloads tend to surprise on the high side). Re-check after a week of real usage and recalibrate.

The unit you're billed in

The model's working memory

The split that matters more than people realise

What this costs in practice

When pricing actually constrains your project

Where pricing assumptions break

FAQ

Why does the same prompt sometimes use a different number of tokens?

How do I keep costs down without giving up quality?

Why is the consumer chat product flat-priced if the API is per-token?

If context windows are now huge, do I still need RAG?

How do I forecast cost before I build?

Sources & references

Related solutions

AI hallucinations explained

AI privacy — what to watch for

AI procurement checklist for non-technical buyers

AI risk assessment for legal and compliance teams