Document AI services compared

A document AI service reads PDFs, scans, and images — pulling fields out of invoices, forms, contracts, and receipts — so the data can be written into a database without anyone typing. The pieces that do this break into two families.

Specialist OCR services (AWS Textract, Google Document AI, Azure Document Intelligence) are purpose-built for documents. They include features like key-value detection, table extraction, and signature detection. General-purpose vision LLMs (OpenAI Vision, Claude Vision, Gemini Vision) are the same large language models you use for chat, given the ability to see images. They handle novel layouts well; they’re less specialised for high-volume back-office work. Underneath both sits an open-weights DIY tier (Donut, LayoutLMv3, the Qwen 2.5 VL family) — software you run yourself on your own hardware.

The extract-structured-data-from-PDFs piece covers the model-level question. This piece is one layer up: when you need a service — queued processing, batch APIs, schema-aware extraction, audit logs, an SLA — how do the offerings compare, and at what volume does it become cheaper to build the DIY pipeline yourself? Each is the right answer for a specific workload. The decision rule depends on volume, document type, and how much engineering capacity you have to spend.

Side by side

The comparison matrix

	AWS Textract	Google Document AI	OpenAI Vision	Azure AI Document Intelligence	Donut (DIY)
Architecture	Specialist OCR + structured extraction	Specialist OCR + processor library + Custom Extractor (auto-ML)	General-purpose vision LLM with prompt-driven extraction	Specialist OCR + prebuilt + custom models	Open-weights end-to-end document understanding (no separate OCR)
Best at — printed text on standard forms	Excellent — the canonical strength	Excellent	Strong; occasionally over-interprets	Excellent	Strong on documents similar to training data
Best at — handwriting	Strong (English); weaker on other scripts	Strong; multi-language handwriting improving	Good for clear handwriting; fails on messy	Strong	Variable — depends entirely on training data
Best at — tables	Strong — table extraction is a featured product	Strongest — Document AI table extraction is the market leader	Variable — prompt engineering matters; can hallucinate rows	Strong — prebuilt table model is reliable	Limited unless specifically trained on tabular documents
Best at — signatures, checkboxes, key-value pairs	Strong — built-in detection	Strong — built-in detection	Variable — depends on prompt; less reliable than specialist	Strong — built-in	Requires custom training
Multilingual coverage	80+ languages (basic OCR); fewer for full structured extraction	200+ languages OCR; structured-extraction support narrower	Strong across all major languages	~25 languages with full feature support; OCR broader	Depends on the model variant; multilingual open-weights variants exist
Custom-model training	Yes — Textract Adapters	Yes — Document AI Custom Extractor (Vertex AutoML)	Yes — vision fine-tuning available (rolled out 2024); ecosystem and tooling less mature than text fine-tuning	Yes — Custom Neural / Template models	Yes — full fine-tuning on your own data (significant engineering work)
Output format	JSON (text, tables, key-value, signatures)	JSON (entities, key-value, tables, normalized values)	Whatever you prompt for (markdown, JSON, structured)	JSON with confidence scores	Whatever your pipeline produces; full control
Batch / async processing	Yes — StartDocumentAnalysis for large batches	Yes — Document AI processors handle batches	No native async; build your own queue	Yes — async analysis API	Yes — runs in your own batch system
Pricing — typical document	$1.50 / 1,000 pages (text); $50 / 1,000 (forms + tables)	$10 / 1,000 pages (basic); $25–$65 / 1,000 (enterprise processors)	~$5 / 1,000 images at typical resolution	$1.50 / 1,000 pages (read); $10 / 1,000 (prebuilt / general); $50 / 1,000 (custom)	~$0 incremental at scale (hardware amortised)
Data residency / regional support	Multi-region AWS; explicit residency	Multi-region Google; explicit residency	US-based; EU compliance via Enterprise	Multi-region Azure; explicit residency	Wherever you host it
Where it lives in your stack	AWS-native; tight integration with S3 / Lambda	Google-Cloud-native; integrates with BigQuery / Vertex	API-only; integrates anywhere with HTTP	Azure-native; integrates with Logic Apps / Power Automate	Your servers, your kubernetes, your responsibility

The honest one-liner

What to actually use

For under ~10,000 pages/month with mostly standard layouts — OpenAI Vision or Google Document AI. OpenAI Vision wins on flexibility (prompt-driven extraction handles novel layouts without setup); Document AI wins on accuracy for tabular and form-heavy documents. The choice often comes down to which cloud your stack already lives in.

For high-volume invoice or form processing at scale — AWS Textract or Azure Document Intelligence. The per-page pricing on specialist OCR is meaningfully cheaper than vision LLMs at volume, and the structured outputs (key-value pairs, confidence scores, signature detection) are more reliable for back-office automation. Custom adapters / templates earn their keep on your specific document types.

Over 100,000 pages/month, or for sensitive documents — DIY with Donut (or LayoutLMv3, or a fine-tuned Qwen 2.5 VL). The hardware + engineering investment pays back at this volume, the data never leaves your network, and the accuracy on your specific document types can exceed any general-purpose service after fine-tuning. Not the right answer at low volume — the ops burden dominates the savings.

For documents with novel layouts that don’t recur — OpenAI Vision or Claude Vision. The general-purpose vision LLMs handle “I have one weird PDF” better than specialist services because they don’t need a model trained for the specific layout. Cost per page is higher; setup time is zero.

The numbers

What you'll actually pay

AWS Textract — DetectDocumentText $1.50 / 1,000 pages

AWS Textract — AnalyzeDocument (forms + tables) $50 / 1,000 pages (forms) · $15 / 1,000 (queries)

Google Document AI — Document OCR processor ~$10 / 1,000 pages (first 1M); $5 / 1,000 thereafter

Google Document AI — Custom Extractor ~$30 / 1,000 pages prediction (first 1M); training extra

OpenAI Vision — typical document image ~$0.005 per image at ~1,000–1,500 tokens — roughly $5 / 1,000 pages

Azure AI Document Intelligence — Read $1.50 / 1,000 pages

Azure AI Document Intelligence — Layout / General / Prebuilt $10–$50 / 1,000 pages depending on model

Break-even volume — DIY pipeline (one engineer, single A100) ~50,000–100,000 pages/month sustained

Accuracy on standard printed text — specialist services ~98–99% character-level on clear scans; lower on photos

Accuracy on tables — Document AI vs OpenAI Vision Document AI ~95% structured-extraction accuracy on common templates; OpenAI Vision ~80–90%, varies by prompt

Latency — async batch (typical 1,000-page job) 5–30 minutes across hosted services; depends on queue load

Latency — sync single-page extraction ~1–3 seconds across services

The pricing comparison is asymmetric: specialist OCR services are cheaper per page at high volume; vision LLMs are cheaper to set up and more flexible at low volume. The decision often hinges on whether your documents recur (specialist wins) or vary (LLMs win).

When DIY wins

The volume + sensitivity threshold

The DIY path with Donut, LayoutLMv3, or a fine-tuned vision model is the right answer when one or more of these triggers fire:

Very high volume. Above ~100,000 pages/month sustained, the hosted-service bill is a real line item ($1,500–$5,000/month at $15–$50/1,000 pages), and a single engineer with one A100 can match or beat that cost.
Data sensitivity. Health records, legal documents, classified materials — the same constraints that justify a private RAG setup apply here. No vendor’s compliance posture is acceptable.
Very specific document types. Invoices from your top three vendors that always have the same layout; forms your specific industry uses that aren’t in any general-purpose model’s training data. Fine-tuning on your own data produces accuracy ceilings the general services can’t match.
Regulatory requirements that exclude vendor processing. Some jurisdictions and contracts simply forbid sending documents to a third-party cloud. The DIY path is the only path.

If none of these triggers fire, the hosted services are the right answer. The engineering investment of DIY is real — two engineers spending two quarters on it is not unusual — and the maintenance burden is permanent.

Where each fails

The honest failure modes

Textract. US-centric in language quality and document conventions. Tables in non-Latin scripts are weaker. Handwriting outside English drops accuracy noticeably.

Google Document AI. Enterprise-priced relative to AWS for similar capability; the Custom Extractor is powerful but requires more setup than the comparable Textract Adapter or Azure custom model.

OpenAI Vision. Can hallucinate. It is a general-purpose LLM with image input — when asked to extract a table with seven rows, it will sometimes return six rows or eight rows that look plausible. The fact-verification step is essential for any production use; this is not a service to trust blindly on numbers.

Azure AI Document Intelligence. Narrower set of fully-supported languages than the others (currently ~25 with full feature support, more with OCR only); breadth is improving but lags Google.

DIY pipeline. All the ops burden you’d expect — model serving, GPU monitoring, retraining cadence, eval infrastructure. Not the right tool for a small team without dedicated engineering support; perfectly fine for a larger team that has those capabilities anyway.

What's next

Related work

For the model-level question (which model handles which document type), see Extract structured data from PDFs. For the DIY pipeline architecture beyond document AI specifically, see Build a private RAG with no third-party calls. For the vector-store side of the post-extraction pipeline, see Vector databases compared.

Common questions

FAQ

What about LayoutLMv3 and other open-source models?

LayoutLMv3 (Microsoft) is the canonical open-source baseline for document understanding alongside Donut. It's a strong foundation for fine-tuning on specific document types but doesn't ship as a turnkey service — you build the inference pipeline, the post-processing, and the eval loop yourself. For teams committed to the DIY path, LayoutLMv3 and Donut are the two most-used starting points; the choice between them is workload-specific (LayoutLM excels on structured forms, Donut on end-to-end extraction).

Can I use Claude or Gemini Vision for this?

Yes — and increasingly people do. Claude Vision (Sonnet 4.6, Opus 4.7) and Gemini Vision (3.1 Pro) handle most document-AI workloads OpenAI Vision can. The matrix shows OpenAI Vision because it's the most-deployed vision LLM in this category, but the architectural and pricing story applies to all three frontier vision LLMs. Pick the one whose pricing and integration match your stack.

How do these handle forms with checkboxes and signatures?

Specialist services (Textract, Document AI, Azure) have built-in detection for checkboxes and signatures — return structured fields like "signature_present: true" alongside the text content. Vision LLMs (OpenAI, Claude, Gemini) can detect these but rely on prompt engineering to surface them reliably. For high-volume signature verification or compliance-grade checkbox extraction, specialist services are the right tool; for one-off or low-volume needs, vision LLMs work.

What's the multilingual quality drop?

All hosted services support OCR in 50+ languages but full structured extraction (tables, forms, key-value pairs) is typically supported in a narrower subset — usually 10–25 languages with full feature support. Vision LLMs handle multilingual content more uniformly because the multilinguality lives in the underlying LLM. For non-English structured extraction, benchmark on your specific language and document type before committing.

Does fine-tuning help, and on which of these?

Textract Adapters, Google's Document AI Custom Extractor, and Azure's Custom Models are all fine-tuning paths on hosted services — each adds 10–20% accuracy on your specific document types with ~50–500 labelled examples. OpenAI Vision supports fine-tuning (introduced in 2024), but the tooling around it is less mature than for text fine-tuning, so most teams still adapt vision workflows via prompt engineering and post-processing rather than full fine-tunes. DIY pipelines support full fine-tuning at the highest accuracy ceiling but require the most engineering work. For most teams, hosted-service adapters / custom models are the right balance of effort and accuracy gain.

The comparison matrix

What to actually use

What you'll actually pay

The volume + sensitivity threshold

The honest failure modes

Related work

FAQ

What about LayoutLMv3 and other open-source models?

Can I use Claude or Gemini Vision for this?

How do these handle forms with checkboxes and signatures?

What's the multilingual quality drop?

Does fine-tuning help, and on which of these?

Sources & references

Related solutions

Audit-trail generation from system logs

Auto-categorize support tickets by topic and urgency

Auto-generate documentation from PRs and code

Automated invoice and receipt processing