Consider a global insurance carrier that wants to use AI to detect fraud across its claims portfolio. The security team, correctly cautious, redacts claimant identities before any data reaches the model. The project launches. The model fails to detect any cross-claim patterns. The reason: the redaction deleted the very entity links the fraud detection logic depended on. Without knowing that [REDACTED] in one claim file is the same person as [REDACTED] in another, the model cannot identify that a single individual filed three nearly identical claims across different regions in six months.
This is not a hypothetical failure mode. It is the defining limitation of redaction as an AI privacy strategy, and it illustrates a principle that applies across industries: the choice between redaction and pseudonymisation is not a privacy choice. It is an architectural choice that determines what your AI system is capable of doing.
1. What Redaction Actually Does to an AI System
The Mechanics
Redaction is a destructive transformation. It identifies a sensitive token — a name, an account number, a National Insurance number — and replaces it with a null signal, typically a blank space, a black bar, or a generic placeholder like [REDACTED]. The original value is gone. No reference to it survives in the processed document.
For human-reviewed documents — legal discovery, freedom of information responses, medical record disclosures — this is the correct tool. A human reader has no need to track entity identity across documents. Redaction prevents disclosure and that is sufficient.
Why It Fails for LLM Workloads
An LLM operates by attending to relationships — between entities, across time, within and across documents. Redaction systematically destroys those relationships. When the model reads:
"[REDACTED] filed a claim on [REDACTED] referencing policy [REDACTED], matching a prior claim filed by [REDACTED] in [REDACTED]."
...it cannot determine whether the two [REDACTED] claimants are the same person, different people, or related entities. It cannot track the timeline. It cannot identify the policy. The analytical value of the document is near zero.
This is not a model capability limitation — it is a data structure problem. You have handed the model a document from which the signal has been removed. No model, regardless of capability, can reason across null values.
Where Redaction Remains Appropriate
Data Redaction is still the correct approach when the AI task does not require entity tracking — summarisation of a single document, classification of document type, extraction of non-sensitive structural elements. If your AI workflow is stateless with respect to identity, redaction is simpler to implement and carries lower re-identification risk. The error is applying it to tasks that are inherently stateful.
2. Pseudonymisation: Preserving Relational Structure Without Exposing Identity
The Standards-Based Definition
Pseudonymisation is defined in GDPR Article 4(5) as the processing of personal data in such a manner that the data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and subject to technical and organisational measures.
The operative phrase is 'additional information kept separately.' Pseudonymisation does not destroy the entity — it separates the entity's identity from its analytical representation. The analytical representation (the token) travels to the model. The identity (the mapping) stays inside the secure perimeter.
How Token Consistency Enables Cross-Document Reasoning
The critical implementation requirement is consistency. Every occurrence of a given real-world entity across your entire dataset must map to the same token. 'Jane Hartley' in claim file A, email thread B, and policy document C must all become [CLAIMANT_0047] — the same token, every time.
With this consistency in place, the model can do what redaction prevents: it can track [CLAIMANT_0047] across time and documents, identify that [CLAIMANT_0047] exhibits a pattern across three claims, and flag that pattern for investigation — without ever learning that [CLAIMANT_0047] is Jane Hartley. The sensitive identity never leaves your environment. The analytical signal is fully preserved.
