Cyberax AI Playbook
cyberax.com
Explainer · Foundations

Data leakage in AI tools — what to watch for

**Data leakage** is when information you send to an AI tool ends up somewhere you didn't intend — used to train the vendor's next model, retained on their servers, exposed to other customers, or surfaced in a public search index. This page walks through the specific paths data can leak, what the vendors' marketing pages downplay, and which deployments are safe versus which need redesign.

At a glance Last verified · May 2026
Problem solved Identify the specific data-leakage paths in AI tools — training, retention, cross-tenant exposure, search indexing, session-level — and apply the controls that keep confidential information from leaving your environment in ways the marketing pages obscure
Best for Privacy officers, IT directors, compliance leads, founders managing data sensitivity, and anyone deploying AI on regulated or confidential data
Tools OpenAI, Anthropic, Google Cloud, Microsoft, AWS
Difficulty Intermediate
Cost Controls range from $0 (configuration choices) to $50,000+/year (enterprise-tier subscriptions and DLP)

Data leakage is what happens when information you send to an AI tool ends up somewhere you didn’t intend. It can be used to train the vendor’s next model, sit on their servers for longer than expected, leak to other customers in rare failure modes, get surfaced in a public search index, or be exposed inside a user session that crossed a permission boundary. Each path has its own controls.

The “what happens to our data” question gets a different answer depending on which vendor representative answers it. “We never use customer data for training” is the marketing posture. The contract details have nuance, plan-tier differences, and exceptions. The honest picture: data sent to AI tools has multiple potential leakage paths, each governed by different controls, and safe deployment means understanding all of them rather than trusting a single reassurance.

What comes next is the leakage map: the specific paths data takes through AI vendors, what the controls actually do, and the deployment patterns that match each business’s risk profile.

The leakage paths

Where data actually goes

Training corpus path. Some vendor tiers (typically consumer / free) use customer data to train their models. The data may be incorporated into model weights in ways that can be extracted via clever queries. Mitigation: use enterprise / API tiers that contractually exclude data from training, verify in writing, and understand the default settings (some tiers are opt-out, requiring explicit configuration).

Retention path. Data is stored by the vendor for some retention period — for debugging, abuse-prevention, or compliance with subpoenas. Retention windows range from 0 days (some enterprise tiers) to 30 days (most defaults) to indefinite (some consumer settings). The data is accessible to the vendor’s staff and to law enforcement under valid process. Mitigation: enterprise zero-retention agreements; explicit data-deletion guarantees in contract.

Cross-tenant exposure (theoretical). Vendor infrastructure shares resources across customers; in extreme failure modes, data from one customer could leak to another. Documented incidents are rare but have occurred. Mitigation: enterprise-tier isolation features (dedicated capacity, VPC deployment), vendor compliance certifications (SOC 2 Type II), incident-response policies.

Search-indexing path. Some AI tools (Custom GPTs with public sharing, AI-enabled site search, certain RAG products) may index data in ways that surface it to other users or in public search results. Most commonly seen with insufficiently-restricted file uploads or shared assistant configurations. Mitigation: explicit access-control review on every shared AI artifact; periodic audit.

Session-level leakage. Within a single user’s session, the AI may have access to data the user shouldn’t see (e.g., their colleague’s chat history if poorly configured), or may reveal context from prior conversations in ways that surprise users. Mitigation: clear session boundaries, no auto-import of context, explicit user awareness of what’s shared.

Employee-mediated path. The most common in practice. Employees paste confidential information into AI tools — their own, not the company’s sanctioned subscription. The data goes to the AI vendor under consumer-tier terms (which may include training use). Mitigation: enterprise-tier subscriptions that employees prefer to use, acceptable-use policy, DLP tools.

The controls

What actually prevents leakage

Vendor-tier selection. Enterprise tiers consistently offer stronger data-handling: no training on customer data, configurable retention (often zero), regional data residency, dedicated capacity. Pay for the enterprise tier when data sensitivity warrants it; the cost is small relative to leakage exposure.

Contractual review. Read the data-handling clauses. The marketing assertion (“we don’t use data for training”) often has caveats in the contract (default settings, plan-tier differences, opt-out requirements). The contract is what actually binds; the marketing page is suggestion.

Architectural choice. For very-sensitive data, the strongest control is keeping it out of the AI vendor entirely. Self-hosted open-source models running on infrastructure you control eliminate vendor-side leakage paths. See Open-source vs proprietary AI for the framework.

Redaction at the input boundary. Strip identifying information before sending data to AI tools. Names, emails, account IDs, payment details can be replaced with placeholders that preserve the analytical value without exposing identity. Reduces leakage severity even when vendor controls are imperfect.

Acceptable-use policy. Explicit guidance on what data can go where. Prohibits employees from pasting confidential information into consumer AI; specifies which sanctioned tiers handle which data classifications. The policy is enforceable only if backed by reasonable alternatives — pure prohibition without sanctioned options produces shadow use.

DLP and monitoring. Data-loss-prevention tools can flag attempts to upload sensitive content to AI tools. Browser extensions, endpoint agents, network monitoring. Imperfect but materially reduces the easy-leakage vector.

The numbers

What leakage actually costs and how often it happens

Frequency of employee data leakage to AI consumer tools Common at growing companies without explicit policy and enterprise alternatives
Cost of data leakage — minor (internal documents, non-regulated data) Mostly reputational and competitive; rarely material in dollars
Cost of data leakage — major (PHI, PII, regulated data) Significant — GDPR fines (up to 4% global revenue or €20M), HIPAA fines (per-violation tiers up to ~$2.13M annual cap for the willful-neglect-uncorrected tier under current OCR inflation adjustments), class action exposure
Cost of enterprise-tier AI subscription per seat (vs consumer) Typically 1.5–3x more; small relative to leakage exposure
Time to deploy enterprise-tier AI controls 1–3 months for policy, contracts, and rollout
Effectiveness of layered controls (policy + tools + alternatives) High — leakage drops substantially vs single-control approaches

The math favors investment in proper controls; the leakage exposure for material data significantly exceeds the cost of controls.

What's next

Related work

For the broader privacy framework, see AI privacy — what to watch for. For the security risks that complement leakage, see AI security risks for businesses. For the open-source-vs-proprietary architectural choice, see Open-source vs proprietary AI — practical tradeoffs. For the procurement framework, see AI procurement checklist for non-technical buyers.

Common questions

FAQ

If our data is encrypted, doesn't that prevent leakage?

Encryption in transit and at rest is necessary but not sufficient. The vendor still processes the data in plaintext at the inference layer (otherwise the model couldn't read it). The encryption claim covers storage and movement; the model's access to plaintext during inference is where most leakage paths actually originate.

How is enterprise-tier different from consumer for data handling?

Three main differences. (1) Contractual no-training commitment for enterprise (some consumer tiers train by default). (2) Lower or zero data retention (consumer is typically 30 days; enterprise can be zero). (3) Tenant isolation features that may not be available on consumer. The specifics vary by vendor; read each enterprise tier's actual terms.

What about HIPAA — can we use ChatGPT or Claude for PHI?

Only with a Business Associate Agreement (BAA). OpenAI, Anthropic, Microsoft Azure OpenAI, Google Cloud all offer BAAs on enterprise tiers. Consumer tiers don't support BAAs and shouldn't process PHI. The HIPAA path is straightforward but requires the right tier and contract.

Should we just self-host to eliminate vendor data risk?

For very sensitive data, yes. The cost (engineering time, infrastructure) is substantial; the leakage-elimination is complete for vendor-side paths. For most workloads, enterprise-tier API with appropriate controls is sufficient and more practical. Reserve self-hosting for the workloads where vendor exposure is genuinely unacceptable.

Sources & references

Change history (1 entry)
  • 2026-05-13 Initial publication.