Under the Hood: Building a Privacy-First Anonymizer for LLMs
In the era of GenAI, the biggest bottleneck for enterprise adoption isn't model capability—it's privacy. Sending raw financial reports, legal contracts, or customer databases to a public LLM is a non-starter for most compliance teams. At Questa, we solved this by building a dedicated anonymization layer that sits between the user’s data and the LLM. Today, I’m going to walk through the technical implementation of our anonymization engine, specifically how we handle entity detection, concurrency, and complex file reconstruction.

The Core: A Dual-Model Approach
We realized early on that a single Natural Language Processing (NLP) model wasn't enough to catch everything. General Named Entity Recognition (NER) models are great at finding people and locations, but they often miss specific PII patterns like email addresses or ID numbers. To solve this, we implemented a composite pipeline in anonymize/core.py that loads two distinct Hugging Face models:
NER Model: elastic/distilbert-base-uncased-finetuned-conll03-english (Optimized for standard entities like Persons, Organizations, Locations). PII Model: iiiorg/piiranha-v1-detect-personal-information (Specialized for sensitive personal data detection).
The Merge Logic
Running two models introduces a new problem: overlapping entities. If Model A says characters 10-15 are a [NAME] and Model B says characters 12-20 are an [EMAIL], naïve replacement will break the string. We wrote a custom merge_entities algorithm in anonymize/text.py. It sorts entities by their start position and resolves conflicts by prioritizing the longest span and ensuring no "dead space" exists between merged entities.
This ensures that when we redraw the text, we get clean, non-overlapping tokens like [PER] or [LOC], rather than corrupted text.
Handling Structured Data: The CSV Challenge
Anonymizing a PDF is hard, but anonymizing a CSV or Excel file is deceptively difficult. You cannot simply pass a CSV string to an LLM or an NLP model; the lack of sentence structure confuses the context-aware models. We adopted a hybrid approach in anonymize/csv.py that combines heuristic rules with NLP, processed via multithreading for performance.
1. Multithreaded Row Processing
2. Smart Heuristics
3. Column-Level Thresholding
File Reconstruction
We support .docx, .pdf, and .xlsx. The challenge here isn't just reading the text, but putting it back together without breaking the file format.
For PDFs, we use a sequence of pdfplumber to extract text, pass it through our anonymize pipeline, and then use reportlab to generate a brand new PDF stream containing the safe, categorized text.
For Excel, we deconstruct the workbook into DataFrames, process them using the CSV logic mentioned above, and then reconstruct the workbook using openpyxl, preserving the sheet structure.
Why This Matters
By combining specialized NLP models with file-specific heuristics and robust concurrency, we turn "risky" data into "safe" context. This allows our users to leverage the reasoning capabilities of models like GPT-5 or Claude 4.5 on their proprietary data without ever exposing the actual PII.
The code is designed to be modular. Whether we are processing a raw string, a complex legal PDF, or a financial spreadsheet, the pipeline normalizes the input, sanitizes it, and reconstructs it, ensuring that Questa AI remains a secure gateway for enterprise AI.
Technical Summary
Stack: Python, FastAPI, Hugging Face Transformers, PyTorch.
Models: DistilBERT (NER) + Piiranha (PII).
Concurrency: ThreadPoolExecutor for structured data.
Parsers: pdfplumber, openpyxl, python-docx.