December 16, 2024

Under the Hood: Building a Privacy-First Anonymizer for LLMs

In the era of GenAI, the biggest bottleneck for enterprise adoption isn't model capability—it's privacy. Sending raw financial reports, legal contracts, or customer databases to a public LLM is a non-starter for most compliance teams.

At Questa Safe AI, we solved this by building a dedicated anonymization layer that sits between the user’s data and the LLM. Today, I’m going to walk through the architectural decisions behind our engine, specifically how we handle entity detection, concurrency, and complex file reconstruction.

The Core: A Dual-Model Approach

We realized early on that a single Natural Language Processing (NLP) model wasn't enough to catch everything. General Named Entity Recognition (NER) models are great at finding people and locations, but they often miss specific PII patterns like email addresses or ID numbers.

To solve this, our core architecture implements a composite pipeline that loads two distinct models simultaneously:

NER Model: A DistilBERT-based model optimized for information entities like Persons, Organizations, and Locations.
PII Model: A specialized model trained specifically to detect sensitive personal information patterns including middle name, multi-ethnic names etc.

The Logic Behind Merging Entities

Running two models introduces a significant engineering challenge: overlapping entities. A common failure mode occurs when Model A identifies a string of characters (like "10-15") as a Name, while Model B identifies an overlapping range (like "12-20") as an Email. If you simply replace them sequentially, you end up with corrupted, unreadable text.

We solved this by implementing a custom merge algorithm that discovers conflicts between the various data entities. If a conflict is found, the logic merges the two entities into a single, continuous span, preserving the label of the primary entity. This creates a clean data ingestion stream and ensures that when we redraw the text, we get clean, isolated tokens rather than broken strings.

Handling Structured Data: The CSV Challenge

Anonymizing a PDF is difficult, but anonymizing a CSV or Excel file presents a unique set of problems. You cannot simply pass a raw CSV string to an LLM or an NLP model because the lack of sentence structure confuses context-aware models.

We adopted a hybrid approach that combines heuristic rules with NLP, processed via a high-performance multithreading engine.

1. Concurrency for Speed

NLP inference is computationally expensive and slow. Processing a large dataset row-by-row would result in unacceptable latency. To mitigate this, our CSV processor utilizes a thread pool to handle multiple rows concurrently. This allows us to anonymize large datasets in a fraction of the time it would take sequentially.

2. Smart Heuristics

For structured data, we don't rely solely on AI models, which can be overkill for simple data types. We built specific logic to handle common fields efficiently:

Numerics: We explicitly ignore numerical fields to preserve financial context—a critical requirement for our financial reporting features.
Emails: We use regex patterns to instantly catch email addresses without needing a model inference.
Headers: We examine column headers; if a column is explicitly named "Full Name," we force a redaction even if the model is unsure about the content.

3. Column-Level Thresholding

One of our most robust features is column-level validation. After processing a sample of the data, the system calculates a "redaction ratio" for each column. If the ratio exceeds a specific threshold (e.g., 80% of the column contains PII), the system stops processing individual cells and wipes the entire column, replacing it with a common placeholder. This ensures consistency and prevents accidental leakage in columns that contain almost exclusively sensitive data.

File Reconstruction

Support for .docx, .pdf, and .xlsx files requires more than just text processing; it requires maintaining file integrity. The challenge isn't just reading the text, but reassembling it without breaking the binary file format.

For PDFs and Word Documents, the system acts as a sophisticated text-stream filter. It extracts the raw text layer, passes it through our anonymization pipeline, and then uses libraries like reportlab or python-docx to generate a completely new binary file containing the safe, categorized text.

For Excel, the process is even more granular. We deconstruct the workbook into individual sheets, process them using our CSV logic, and then reconstruct using specialized local workbooks. This method preserves the original sheet structure and headers while sanitizing the cell contents.

Resilience and Microservices

In the production instance, the anonymizer runs as a separate, isolated service. To ensure high availability, our client architecture implements a robust fallback mechanism.

The client is configured with both a primary URL and a backup URL. When a request is made, the system attempts to reach the primary inference server. If that server is overloaded or unresponsive, the client automatically catches the connection error and seamlessly routes the traffic to the backup cluster. This happens transparently to the user, ensuring that the anonymization service remains available even during partial infrastructure failures.

Blackbox for Enterprise Grade Security

We have condensed our entire logic into a Blackbox for enterprise grade security. The Blackbox technique ensures that the entire component can be installed on premise within an enterprise server before any API request comes from other services including LLMs. If enterprise servers are not available an exclusive cloud account can also be set up as per the organization's data residency constraints. The benefit of independent anonymization as a localized pre-filter is that it makes us agnostic to any services including AI Models working with the redacted output dataset and protects user's private data as a privacy firewall in all circumstances.

Why This Matters

By combining specialized NLP models with file-specific heuristics and robust concurrency, we turn "risky" data into "safe" context. This allows our users to leverage the reasoning capabilities of models like GPT-4 or Claude 3 on their proprietary data without ever exposing the actual PII.

The architecture is designed to be modular. Whether we are processing a raw string, a complex legal PDF, or a financial spreadsheet, the pipeline normalizes the input, sanitizes it, and reconstructs it, ensuring that Questa remains a secure gateway for enterprise AI.

Technical Summary

Stack: Python, FastAPI, Hugging Face Transformers, PyTorch.
Models: DistilBERT (NER) + Piiranha (PII).
Concurrency: ThreadPoolExecutor for structured data.
Parsers: pdfplumber, openpyxl, python-docx.