FEB 02, 2026
SHARE :

Under the Hood: Building a Privacy-First Anonymizer for LLMs

In the era of GenAI, the biggest bottleneck for enterprise adoption isn't model capability—it's privacy. Sending raw financial reports, legal contracts, or customer databases to a public LLM is a non-starter for most compliance teams. At Questa, we solved this by building a dedicated anonymization layer that sits between the user’s data and the LLM. Today, I’m going to walk through the technical implementation of our anonymization engine, specifically how we handle entity detection, concurrency, and complex file reconstruction.

BPOs (1)

The Core: A Dual-Model Approach

We realized early on that a single Natural Language Processing (NLP) model wasn't enough to catch everything. General Named Entity Recognition (NER) models are great at finding people and locations, but they often miss specific PII patterns like email addresses or ID numbers. To solve this, we implemented a composite pipeline in anonymize/core.py that loads two distinct Hugging Face models:

NER Model: elastic/distilbert-base-uncased-finetuned-conll03-english (Optimized for standard entities like Persons, Organizations, Locations). PII Model: iiiorg/piiranha-v1-detect-personal-information (Specialized for sensitive personal data detection).

The Merge Logic

Running two models introduces a new problem: overlapping entities. If Model A says characters 10-15 are a [NAME] and Model B says characters 12-20 are an [EMAIL], naïve replacement will break the string. We wrote a custom merge_entities algorithm in anonymize/text.py. It sorts entities by their start position and resolves conflicts by prioritizing the longest span and ensuring no "dead space" exists between merged entities.

This ensures that when we redraw the text, we get clean, non-overlapping tokens like [PER] or [LOC], rather than corrupted text.

Handling Structured Data: The CSV Challenge

Anonymizing a PDF is hard, but anonymizing a CSV or Excel file is deceptively difficult. You cannot simply pass a CSV string to an LLM or an NLP model; the lack of sentence structure confuses the context-aware models. We adopted a hybrid approach in anonymize/csv.py that combines heuristic rules with NLP, processed via multithreading for performance.

1. Multithreaded Row Processing

NLP inference is CPU-bound and slow. Processing a 10,000-row CSV sequentially would take forever. We utilized Python’s ThreadPoolExecutor to redact rows concurrently. # From anonymize/csv.py with ThreadPoolExecutor(max_workers=MAX_THREADS) as executor: futures = {executor.submit(redact_row, row): idx for idx, row in enumerate(df_sample)} # ... process results as they complete ...

2. Smart Heuristics

For structured data, we don't rely solely on the AI models. We implemented specific logic for common fields to save compute time and increase accuracy Numerics: Ignored automatically to preserve financial context (a crucial requirement for our financial reporting features). Emails: Detected via regex patterns (@ symbols) and strictly redacted. Headers: We check column headers. If a column is named "Full Name," we force a [PER] redaction even if the model is unsure.

3. Column-Level Thresholding

This is my favorite feature. Once we process a sample of the data, we calculate a "redaction ratio" for each column. If more than 80% of a column contains PII (e.g., a column of home addresses), we don't bother processing the rest—we wipe the entire column and replace it with the most common placeholder (e.g., [ADDRESS]) .

File Reconstruction

We support .docx, .pdf, and .xlsx. The challenge here isn't just reading the text, but putting it back together without breaking the file format.

For PDFs, we use a sequence of pdfplumber to extract text, pass it through our anonymize pipeline, and then use reportlab to generate a brand new PDF stream containing the safe, categorized text.

For Excel, we deconstruct the workbook into DataFrames, process them using the CSV logic mentioned above, and then reconstruct the workbook using openpyxl, preserving the sheet structure.

Why This Matters

By combining specialized NLP models with file-specific heuristics and robust concurrency, we turn "risky" data into "safe" context. This allows our users to leverage the reasoning capabilities of models like GPT-5 or Claude 4.5 on their proprietary data without ever exposing the actual PII.

The code is designed to be modular. Whether we are processing a raw string, a complex legal PDF, or a financial spreadsheet, the pipeline normalizes the input, sanitizes it, and reconstructs it, ensuring that Questa AI remains a secure gateway for enterprise AI.

Technical Summary

Stack: Python, FastAPI, Hugging Face Transformers, PyTorch.

Models: DistilBERT (NER) + Piiranha (PII).

Concurrency: ThreadPoolExecutor for structured data.

Parsers: pdfplumber, openpyxl, python-docx.