An engineering team adds three AI features over 18 months. They pick OpenAI for the chat assistant, Pinecone for the vector database (storage tuned for AI search), and Anthropic for the agent workflow. Each choice is reasonable in isolation. Then OpenAI raises prices, Anthropic deprecates the model the agent depends on, and a Pinecone outage takes search down for half a day. Migrating off any one of them is now a multi-week project. The cumulative dependency built up quietly, choice by choice, and became strategic exposure.
This piece is the honest analysis. Where lock-in actually sits in AI stacks (four levels). The realistic mitigation patterns (most are simpler than they look). And which lock-in is acceptable vs unacceptable for which workloads — not “avoid all lock-in,” but selective mitigation against the highest-impact risks.
Where lock-in actually lives
AI-stack lock-in lives at four levels:
-
Model API lock-in. Your code calls a specific vendor’s API. The cost of switching is “rewrite the API calls” — usually modest engineering work, especially with abstraction libraries (LiteLLM, langchain). Low to moderate severity.
-
Capability lock-in. Some vendors offer features that don’t exist elsewhere — long-context, specific tool integrations, particular reasoning patterns. Migrating to a different vendor means losing the feature or building it yourself. Moderate severity, vendor-specific.
-
Embedding-space lock-in. Embeddings stored in a vector store with one model are incompatible with another model’s embeddings. Re-embedding 100 million documents to switch is a real cost — engineering work plus the embedding cost itself. High severity at scale.
-
Customisation lock-in. Fine-tuned models on proprietary infrastructure can’t transfer to another vendor. The training data and fine-tuning work is sunk cost. High severity if you’ve invested heavily.
Each level has different mitigation patterns; the cost of mitigation differs widely too.
What actually reduces lock-in risk
Model API abstraction layer. Use a library like LiteLLM, Vercel AI SDK, or LangChain that abstracts the vendor-specific API. The cost is small (a wrapper layer) and the benefit is large: switching the underlying model becomes a config change, not a rewrite. Apply this for any production AI workload.
Multi-vendor capability evaluation. Before committing to a vendor-specific feature, evaluate the equivalent on alternative vendors. Often the differentiating capability is matched (or close enough) elsewhere, and the architectural decision can accommodate either. If the feature is genuinely unique and central to your product, the lock-in is acceptable; if it’s marginal, abstract around it.
Embedding-store discipline. Pick an embedding model with explicit migration considerations. OpenAI text-embedding-3 with Matryoshka dimensions, Cohere with stable versioning, or self-hosted open-source models with reproducible weights. Avoid pinning to embedding models with unclear backward-compatibility policies.
Fine-tuning portability. Where possible, use open-source base models for fine-tuning so the trained weights remain yours. Vendor fine-tuning APIs (OpenAI, Anthropic) produce models tied to their infrastructure; the work doesn’t transfer. Decide which lock-in is acceptable per workload.
Periodic vendor re-evaluation. Set a calendar reminder every 6 months to re-evaluate the AI vendor decisions. Pricing, capability, and policy shifts are frequent; the right answer changes. Without the cadence, you ossify on early decisions.
Calibrating the trade-offs
Not all lock-in is bad. The mitigation has engineering and cost overhead; over-mitigating produces brittle abstractions that fight the vendor’s actual capabilities. The framework:
-
For workloads where you genuinely depend on one vendor’s flagship capability (very long context, specific reasoning depth, unique tool integrations), the lock-in is the cost of access to the capability. Accept it; the alternative is shipping a worse product.
-
For routine workloads (classification, summarisation, structured extraction), the capability is broadly equivalent across vendors. Mitigate lock-in here aggressively; nothing valuable is being protected by single-vendor dependency.
-
For workloads with large embedded data (RAG over millions of documents, vector-search products), the embedding-store lock-in is the highest-impact. Invest in mitigation early; retrofitting is expensive.
-
For workloads with regulatory or compliance requirements, the vendor’s compliance posture is the binding constraint. Lock-in is acceptable if the vendor’s posture meets your needs; verify the alternative vendors’ posture before relying on a switch option.
What lock-in actually costs when it bites
The migration costs are real but manageable with mitigation; the strategic value of optionality is substantial in negotiations and crises.
How AI vendor decisions typically go wrong
Optimising for current-best capability without considering switching cost. Picking the leader on a 2024 benchmark commits you to that vendor’s roadmap. If their priorities diverge from yours, you’re stuck.
Skipping abstraction because “we’re moving fast.” The abstraction layer is small overhead at the start and large savings later. Teams that defer abstraction repeatedly are not actually moving fast on the things that matter.
Treating all vendors as equivalent. Not every vendor’s lock-in is the same. Pricing volatility, deprecation cadence, geographic availability, compliance posture all differ. Evaluate vendors on these dimensions, not just on demo quality.
Ignoring capacity / availability risk. AI vendors have had outages, capacity-constraint periods, and policy-driven access changes. Single-vendor dependency means your product is exposed to all of these.
Over-engineering for switch-ability. The inverse failure mode — building elaborate abstractions to protect against unlikely switches. Calibrate mitigation to realistic risk; protect against the high-impact and high-probability lock-in, accept lock-in elsewhere.
Related work
For the broader open-source-vs-proprietary framework, see Open-source vs proprietary AI — practical tradeoffs. For the local-vs-cloud architecture decisions, see When to run AI locally vs in the cloud. For the economics behind the lock-in trade-offs, see Tokens, context windows, and what they cost. For the cost-vector framework, see The hidden costs of “free” AI tools.
FAQ
Should we use Azure OpenAI or AWS Bedrock instead of direct vendor APIs to reduce lock-in?
Partially. Azure OpenAI uses OpenAI's models in Microsoft infrastructure; Bedrock aggregates multiple model vendors in AWS. Both reduce some lock-in (cloud-vendor integration is unified) but introduce different lock-in (to the cloud vendor's specific deployments and pricing). The decision often comes down to which cloud vendor you're already on rather than to lock-in pure-play.
Is LiteLLM / LangChain abstraction enough?
For API-level abstraction, yes — switching vendors is a config change once the abstraction is in place. For deeper lock-in (embeddings, fine-tunes, capability-specific features), the abstraction layer doesn't help; those require deeper architectural choices. Use the abstraction at the API level; address the other lock-in types separately.
How often should we re-evaluate vendor choices?
Every 6 months for active production AI workloads; annually for stable ones. The category moves fast enough that quarterly re-evaluation is overkill (the cost of re-evaluation exceeds the value of the changes) but annual is too slow (changes accumulate that could have been acted on).
Can we negotiate against the lock-in?
Yes — enterprise contracts increasingly include provisions for model-version stability, advance notice of deprecations, and pricing predictability. Use the lock-in awareness as negotiation leverage; vendors that won't commit to stability are signalling future lock-in pain.