Cyberax AI Playbook
cyberax.com
Comparison · Operations & Knowledge · Local-OK

Document AI services compared

A **document AI service** reads invoices, contracts, forms, and other documents — pulling out the fields you need so the data can flow into a database or accounting system. This piece compares AWS Textract, Google Document AI, OpenAI Vision, Azure Document Intelligence, and a do-it-yourself pipeline with open-weights models — what each is good for at scale, what each costs per page, and where each breaks on real documents.

At a glance Last verified · May 2026
Problem solved Pick a document AI service for invoice processing, form extraction, contract review, or similar workloads — and recognise when an open-weights DIY pipeline saves money over hosted services
Best for Ops leads, engineers building document pipelines, and finance/AP teams evaluating automation
Tools AWS Textract, Google Document AI, OpenAI Vision, Azure AI Document Intelligence, Donut
Difficulty Intermediate
Cost $1.50–$25 per thousand pages (hosted) → $0 incremental on DIY pipeline (after hardware + engineering investment)

A document AI service reads PDFs, scans, and images — pulling fields out of invoices, forms, contracts, and receipts — so the data can be written into a database without anyone typing. The pieces that do this break into two families.

Specialist OCR services (AWS Textract, Google Document AI, Azure Document Intelligence) are purpose-built for documents. They include features like key-value detection, table extraction, and signature detection. General-purpose vision LLMs (OpenAI Vision, Claude Vision, Gemini Vision) are the same large language models you use for chat, given the ability to see images. They handle novel layouts well; they’re less specialised for high-volume back-office work. Underneath both sits an open-weights DIY tier (Donut, LayoutLMv3, the Qwen 2.5 VL family) — software you run yourself on your own hardware.

The extract-structured-data-from-PDFs piece covers the model-level question. This piece is one layer up: when you need a service — queued processing, batch APIs, schema-aware extraction, audit logs, an SLA — how do the offerings compare, and at what volume does it become cheaper to build the DIY pipeline yourself? Each is the right answer for a specific workload. The decision rule depends on volume, document type, and how much engineering capacity you have to spend.

Side by side

The comparison matrix

AWS TextractGoogle Document AIOpenAI VisionAzure AI Document IntelligenceDonut (DIY)
Architecture Specialist OCR + structured extractionSpecialist OCR + processor library + Custom Extractor (auto-ML)General-purpose vision LLM with prompt-driven extractionSpecialist OCR + prebuilt + custom modelsOpen-weights end-to-end document understanding (no separate OCR)
Best at — printed text on standard forms Excellent — the canonical strengthExcellentStrong; occasionally over-interpretsExcellentStrong on documents similar to training data
Best at — handwriting Strong (English); weaker on other scriptsStrong; multi-language handwriting improvingGood for clear handwriting; fails on messyStrongVariable — depends entirely on training data
Best at — tables Strong — table extraction is a featured productStrongest — Document AI table extraction is the market leaderVariable — prompt engineering matters; can hallucinate rowsStrong — prebuilt table model is reliableLimited unless specifically trained on tabular documents
Best at — signatures, checkboxes, key-value pairs Strong — built-in detectionStrong — built-in detectionVariable — depends on prompt; less reliable than specialistStrong — built-inRequires custom training
Multilingual coverage 80+ languages (basic OCR); fewer for full structured extraction200+ languages OCR; structured-extraction support narrowerStrong across all major languages~25 languages with full feature support; OCR broaderDepends on the model variant; multilingual open-weights variants exist
Custom-model training Yes — Textract AdaptersYes — Document AI Custom Extractor (Vertex AutoML)Yes — vision fine-tuning available (rolled out 2024); ecosystem and tooling less mature than text fine-tuningYes — Custom Neural / Template modelsYes — full fine-tuning on your own data (significant engineering work)
Output format JSON (text, tables, key-value, signatures)JSON (entities, key-value, tables, normalized values)Whatever you prompt for (markdown, JSON, structured)JSON with confidence scoresWhatever your pipeline produces; full control
Batch / async processing Yes — StartDocumentAnalysis for large batchesYes — Document AI processors handle batchesNo native async; build your own queueYes — async analysis APIYes — runs in your own batch system
Pricing — typical document $1.50 / 1,000 pages (text); $50 / 1,000 (forms + tables)$10 / 1,000 pages (basic); $25–$65 / 1,000 (enterprise processors)~$5 / 1,000 images at typical resolution$1.50 / 1,000 pages (read); $10 / 1,000 (prebuilt / general); $50 / 1,000 (custom)~$0 incremental at scale (hardware amortised)
Data residency / regional support Multi-region AWS; explicit residencyMulti-region Google; explicit residencyUS-based; EU compliance via EnterpriseMulti-region Azure; explicit residencyWherever you host it
Where it lives in your stack AWS-native; tight integration with S3 / LambdaGoogle-Cloud-native; integrates with BigQuery / VertexAPI-only; integrates anywhere with HTTPAzure-native; integrates with Logic Apps / Power AutomateYour servers, your kubernetes, your responsibility
The honest one-liner

What to actually use

For under ~10,000 pages/month with mostly standard layouts — OpenAI Vision or Google Document AI. OpenAI Vision wins on flexibility (prompt-driven extraction handles novel layouts without setup); Document AI wins on accuracy for tabular and form-heavy documents. The choice often comes down to which cloud your stack already lives in.

For high-volume invoice or form processing at scale — AWS Textract or Azure Document Intelligence. The per-page pricing on specialist OCR is meaningfully cheaper than vision LLMs at volume, and the structured outputs (key-value pairs, confidence scores, signature detection) are more reliable for back-office automation. Custom adapters / templates earn their keep on your specific document types.

Over 100,000 pages/month, or for sensitive documents — DIY with Donut (or LayoutLMv3, or a fine-tuned Qwen 2.5 VL). The hardware + engineering investment pays back at this volume, the data never leaves your network, and the accuracy on your specific document types can exceed any general-purpose service after fine-tuning. Not the right answer at low volume — the ops burden dominates the savings.

For documents with novel layouts that don’t recur — OpenAI Vision or Claude Vision. The general-purpose vision LLMs handle “I have one weird PDF” better than specialist services because they don’t need a model trained for the specific layout. Cost per page is higher; setup time is zero.

The numbers

What you'll actually pay

AWS Textract — DetectDocumentText $1.50 / 1,000 pages
AWS Textract — AnalyzeDocument (forms + tables) $50 / 1,000 pages (forms) · $15 / 1,000 (queries)
Google Document AI — Document OCR processor ~$10 / 1,000 pages (first 1M); $5 / 1,000 thereafter
Google Document AI — Custom Extractor ~$30 / 1,000 pages prediction (first 1M); training extra
OpenAI Vision — typical document image ~$0.005 per image at ~1,000–1,500 tokens — roughly $5 / 1,000 pages
Azure AI Document Intelligence — Read $1.50 / 1,000 pages
Azure AI Document Intelligence — Layout / General / Prebuilt $10–$50 / 1,000 pages depending on model
Break-even volume — DIY pipeline (one engineer, single A100) ~50,000–100,000 pages/month sustained
Accuracy on standard printed text — specialist services ~98–99% character-level on clear scans; lower on photos
Accuracy on tables — Document AI vs OpenAI Vision Document AI ~95% structured-extraction accuracy on common templates; OpenAI Vision ~80–90%, varies by prompt
Latency — async batch (typical 1,000-page job) 5–30 minutes across hosted services; depends on queue load
Latency — sync single-page extraction ~1–3 seconds across services

The pricing comparison is asymmetric: specialist OCR services are cheaper per page at high volume; vision LLMs are cheaper to set up and more flexible at low volume. The decision often hinges on whether your documents recur (specialist wins) or vary (LLMs win).

When DIY wins

The volume + sensitivity threshold

The DIY path with Donut, LayoutLMv3, or a fine-tuned vision model is the right answer when one or more of these triggers fire:

  • Very high volume. Above ~100,000 pages/month sustained, the hosted-service bill is a real line item ($1,500–$5,000/month at $15–$50/1,000 pages), and a single engineer with one A100 can match or beat that cost.
  • Data sensitivity. Health records, legal documents, classified materials — the same constraints that justify a private RAG setup apply here. No vendor’s compliance posture is acceptable.
  • Very specific document types. Invoices from your top three vendors that always have the same layout; forms your specific industry uses that aren’t in any general-purpose model’s training data. Fine-tuning on your own data produces accuracy ceilings the general services can’t match.
  • Regulatory requirements that exclude vendor processing. Some jurisdictions and contracts simply forbid sending documents to a third-party cloud. The DIY path is the only path.

If none of these triggers fire, the hosted services are the right answer. The engineering investment of DIY is real — two engineers spending two quarters on it is not unusual — and the maintenance burden is permanent.

Where each fails

The honest failure modes

Textract. US-centric in language quality and document conventions. Tables in non-Latin scripts are weaker. Handwriting outside English drops accuracy noticeably.

Google Document AI. Enterprise-priced relative to AWS for similar capability; the Custom Extractor is powerful but requires more setup than the comparable Textract Adapter or Azure custom model.

OpenAI Vision. Can hallucinate. It is a general-purpose LLM with image input — when asked to extract a table with seven rows, it will sometimes return six rows or eight rows that look plausible. The fact-verification step is essential for any production use; this is not a service to trust blindly on numbers.

Azure AI Document Intelligence. Narrower set of fully-supported languages than the others (currently ~25 with full feature support, more with OCR only); breadth is improving but lags Google.

DIY pipeline. All the ops burden you’d expect — model serving, GPU monitoring, retraining cadence, eval infrastructure. Not the right tool for a small team without dedicated engineering support; perfectly fine for a larger team that has those capabilities anyway.

What's next

Related work

For the model-level question (which model handles which document type), see Extract structured data from PDFs. For the DIY pipeline architecture beyond document AI specifically, see Build a private RAG with no third-party calls. For the vector-store side of the post-extraction pipeline, see Vector databases compared.

Common questions

FAQ

What about LayoutLMv3 and other open-source models?

LayoutLMv3 (Microsoft) is the canonical open-source baseline for document understanding alongside Donut. It's a strong foundation for fine-tuning on specific document types but doesn't ship as a turnkey service — you build the inference pipeline, the post-processing, and the eval loop yourself. For teams committed to the DIY path, LayoutLMv3 and Donut are the two most-used starting points; the choice between them is workload-specific (LayoutLM excels on structured forms, Donut on end-to-end extraction).

Can I use Claude or Gemini Vision for this?

Yes — and increasingly people do. Claude Vision (Sonnet 4.6, Opus 4.7) and Gemini Vision (3.1 Pro) handle most document-AI workloads OpenAI Vision can. The matrix shows OpenAI Vision because it's the most-deployed vision LLM in this category, but the architectural and pricing story applies to all three frontier vision LLMs. Pick the one whose pricing and integration match your stack.

How do these handle forms with checkboxes and signatures?

Specialist services (Textract, Document AI, Azure) have built-in detection for checkboxes and signatures — return structured fields like "signature_present: true" alongside the text content. Vision LLMs (OpenAI, Claude, Gemini) can detect these but rely on prompt engineering to surface them reliably. For high-volume signature verification or compliance-grade checkbox extraction, specialist services are the right tool; for one-off or low-volume needs, vision LLMs work.

What's the multilingual quality drop?

All hosted services support OCR in 50+ languages but full structured extraction (tables, forms, key-value pairs) is typically supported in a narrower subset — usually 10–25 languages with full feature support. Vision LLMs handle multilingual content more uniformly because the multilinguality lives in the underlying LLM. For non-English structured extraction, benchmark on your specific language and document type before committing.

Does fine-tuning help, and on which of these?

Textract Adapters, Google's Document AI Custom Extractor, and Azure's Custom Models are all fine-tuning paths on hosted services — each adds 10–20% accuracy on your specific document types with ~50–500 labelled examples. OpenAI Vision supports fine-tuning (introduced in 2024), but the tooling around it is less mature than for text fine-tuning, so most teams still adapt vision workflows via prompt engineering and post-processing rather than full fine-tunes. DIY pipelines support full fine-tuning at the highest accuracy ceiling but require the most engineering work. For most teams, hosted-service adapters / custom models are the right balance of effort and accuracy gain.

Sources & references

Change history (1 entry)
  • 2026-05-11 Initial publication.