Summary
Personally Identifiable Information (PII) no longer stays confined to databases. In Gen AI systems, it flows across prompts, embeddings, retrieval pipelines, and generated outputs — often beyond the boundaries organizations think they control.
This article explores how Gen AI reshapes privacy risks, why traditional controls fall short, and what it takes to design systems that manage exposure, not just data access.
Target Audience
- Data engineers / data architects building AI or RAG pipelines
- Machine learning engineers / AI practitioners working with LLMs
- Security, privacy, and governance professionals responsible for data protection
- Product managers designing AI-powered features
- Developers using tools like Copilot / ChatGPT with access to sensitive data
Outline
- What Does “PII in Freefall” Mean?
- Where the Boundary Breaks
- Why Traditional Controls Fall Short
- So What Should We Do Instead?
- The Shift: From Data Protection to Exposure Control
Most privacy frameworks assume one thing:
Data is stored, queried, and controlled within defined boundaries.
Gen AI breaks that assumption.
Data is no longer just:
- Stored in tables
- Queried through SQL
- Governed by access control
Instead, it is:
- Embedded into prompts
- Transformed into embeddings
- Reconstructed through generation
Research has shown that large language models can memorize and reproduce training data, including sensitive information, turning privacy into a system-wide concern rather than a storage problem (Carlini et al., 2021; Bommasani et al., 2021). And in this process, Personally Identifiable Information (PII) is quietly entering freefall.
What Does “PII in Freefall” Mean?
By “freefall,” I mean that PII is no longer anchored to a single system boundary — it moves across layers where traditional controls no longer apply.
In traditional systems, PII has structure:
- columns such as email, nric_id, or phone_number
- tables
- access policies
You know:
- where it is
- who can access it
- how it is used
With Gen AI, PII becomes fluid.
It can:
- appear in prompts
- be encoded in embeddings
- surface in generated outputs
- be recombined in ways you did not explicitly design
PII is no longer confined to a location — it is distributed across the system lifecycle.
Where the Boundary Breaks
Let’s walk through a typical Gen AI pipeline.
1. Prompt Layer
Users or systems provide input directly to the model.
Example:
A developer pastes internal source code into Copilot or ChatGPT to debug an issue.
That code may contain:
- proprietary algorithms
- API keys
- internal endpoints
Even if the model does not explicitly “store” it in a database, the data has already crossed a boundary.
Similarly:
A user chats with ChatGPT and shares personal details — name, address, health concerns — while asking for advice.
From a privacy perspective:
- the prompt itself is already sensitive data
- logging, monitoring, or misuse of prompts becomes a risk surface
Questions to ask:
- Are prompts logged?
- Who can access those logs?
- Are users aware of what they are sharing?
2. Embedding Layer
Text is converted into vectors for retrieval.
At this point, PII is no longer readable — but it still exists in encoded form.
A common misconception:
“If it’s not readable, it’s safe.”
In reality:
- embeddings can still be queried
- similar inputs can retrieve related sensitive content
- encoded data can still influence outputs
This aligns with broader research showing that transformed representations can still leak information under certain conditions (Fredrikson et al., 2015; Carlini et al., 2021).
3. Retrieval (RAG)
Systems retrieve relevant documents and inject them into the model context.
Real-world risk:
An internal knowledge base contains customer records. A poorly scoped retrieval query pulls in sensitive customer details and feeds them into the LLM.
Now:
- the model sees data the user may not be authorized to access
- the output may expose it
This is especially dangerous because:
- access control is often enforced at storage, not retrieval
- retrieval systems often optimize for relevance — not authorization.
- RAG pipelines are sometimes treated as “just search,” not as a sensitive boundary
4. Generation Layer
This is where exposure becomes visible.
The model may:
- reproduce memorized PII content
- infer missing personal details
- combine fragments into identifiable information
This is not hypothetical.
- Models have been shown to reproduce training data verbatim under certain prompts (Carlini et al., 2021)
- Systems can reveal whether specific records were part of training data (Shokri et al., 2017)
No database query.
No direct access.
The system didn’t “store” PII here — but it still leaked it.
Why Traditional Controls Fall Short
Traditional privacy controls assume:
- structured data
- role-based access
- controlled queries
Gen AI systems operate differently:
- unstructured inputs
- prompt-driven access
- cross-layer data flow
- probabilistic outputs
This mismatch creates blind spots.
The system may be “secure” at rest — but still leak information in motion.
So What Should We Do Instead?
The answer is not one technique — it’s a system-level approach.
1. Treat Prompts as Sensitive Data
- Apply PII detection before sending prompts
- Avoid logging raw prompts where possible
- Mask or redact inputs dynamically
- Educate users (e.g., developers, customers) about what not to share
2. Enforce Access Control in Retrieval
- Apply row-level and column-level filters before retrieval
- Align RAG pipelines with governance policies
3. Apply Output Filtering
- Scan generated responses for PII
- Block or redact sensitive outputs
4. Use PETs at the Right Layer
- Pseudonymisation → internal linking
- Aggregation → analytics
- Differential privacy → external reporting
5. Redesign the Pipeline Around Trust Boundaries
Instead of asking:
“Is the data safe?”
Ask:
“At which stage can this data be exposed — and to whom?”
The Shift: From Data Protection to Exposure Control
Traditional mindset:
Protect the dataset
Gen AI reality:
Control how information flows and emerges
This is a fundamental shift.
Final Thought
Gen AI doesn’t just introduce new capabilities — it changes the nature of data itself.
Data is no longer static.
It moves, transforms, and reappears in unexpected ways.
And when it comes to PII:
The line between safe and exposed is no longer clear — it is constantly shifting.
If you work with AI systems
Privacy is no longer just a compliance requirement.
It is an architectural decision.
Privacy is no longer enforced at the edges of the system — it must be designed into every layer.
References
- Bommasani et al. (2021), On the Opportunities and Risks of Foundation Models
- Carlini et al. (2021), Extracting Training Data from Large Language Models
- Fredrikson et al. (2015), Model Inversion Attacks
- Shokri et al. (2017), Membership Inference Attacks

Follow me on LinkedIn | 👏🏽 for my story | Follow me on Medium
PII in Freefall: How Gen AI Is Blurring the Line Between Safe and Exposed was originally published in DataDrivenInvestor on Medium, where people are continuing the conversation by highlighting and responding to this story.