Generative AI changes the threat model for personally identifiable information in a specific and underappreciated way. In a traditional database architecture, PII is relatively easy to silo: access controls, field-level encryption, and query logging provide a manageable perimeter. In an AI pipeline, the same data becomes fluid. It moves through prompt buffers, gets embedded into vector representations, passes through inference endpoints, and surfaces in model outputs — often in forms that were not anticipated at design time.
The governance challenge is not simply preventing unauthorised access. It is ensuring that PII cannot reach any point in the pipeline where it is not supposed to be — and proving that it did not, to a regulator who asks. These are architectural problems, not policy problems, and they require architectural solutions.
1. The Ingress Gateway: Intercepting PII Before It Reaches the Model
Why the Ingress Point Is the Highest-Risk Moment
The moment a user query or a database record is pulled for AI processing is the point of maximum PII exposure risk. If raw personal data reaches a model endpoint — whether an external API or an internal inference server — several things happen simultaneously that are difficult to reverse: the data may be written to inference logs, cached in the model's KV cache, or in the case of external APIs, subject to the provider's data retention policy.
Intercepting PII before it crosses that threshold is therefore the first and most important governance control. Everything downstream depends on it being in place.
The Gateway Architecture
A PII gateway is a middleware layer that sits between your data sources and your AI inference endpoint. It performs three sequential operations on every payload that passes through it:
- Named Entity Recognition (NER): the gateway runs a language model or rules-based classifier over the incoming text to identify PII entities — names, account numbers, National Insurance or Social Security numbers, dates of birth, postcodes, medical codes, IP addresses, and any domain-specific identifiers relevant to your sector.
- Token assignment: each identified entity is replaced with a consistent pseudonymous token. 'Sarah Chen' becomes [USER_8821]. The same real value always maps to the same token within a defined scope (session, document set, or dataset — depending on your use case).
- Mapping table write: the real-value-to-token pair is written to an encrypted mapping table stored inside the secure perimeter, isolated from the inference environment. This table is the re-identification key and is the most sensitive asset in the pipeline.
Correcting a Common Implementation Error: Stateless vs Stateful
Gateway documentation frequently describes this layer as 'stateless middleware' — a misleading label. A stateless system processes each request with no memory of prior requests. A tokenisation gateway, by definition, must maintain state: it needs to know that 'Sarah Chen' was previously mapped to [USER_8821] so it can apply the same token consistently across a multi-turn conversation or a multi-document dataset.
The correct framing is that the gateway is stateless with respect to business logic — it applies no domain reasoning, makes no decisions about the content — but stateful with respect to the token mapping. The mapping store should be treated as a high-security dependency, not an incidental component.
Tooling
- Microsoft Presidio (open source): NLP-based PII detection with 50+ entity types and customisable recognisers. Deployable on-premises. The most practical starting point for most enterprise deployments.
- Private AI: commercial offering with higher detection accuracy on domain-specific entity types in healthcare and financial services. Supports streaming pipelines.
- spaCy with custom NER models: appropriate when your sector uses terminology that general-purpose models do not recognise — for example, proprietary financial instrument identifiers or clinical codes not covered by standard healthcare NLP models.
2. Tokenisation vs Redaction: Choosing the Right Tool for the Task
The Distinction That Determines Project Viability
Tokenisation and redaction are not interchangeable privacy techniques that differ only in sophistication. They produce categorically different data structures, and the choice between them determines what your AI system can do.
Redaction is destructive. It replaces a PII entity with a null signal — a blank, a black bar, or a placeholder like [REDACTED]. The original value is gone. The model receives no information about the entity and cannot track it across documents or across a conversation.
Tokenisation (pseudonymisation under GDPR Article 4(5)) is preservative. It replaces a PII entity with a consistent, opaque token. The model receives no identity information but can track the token as a consistent entity — understanding that [USER_8821] is the same person across five documents, has a specific transaction history, and exhibits a particular behavioural pattern.