The Core: A Dual-Model Approach
We realized early on that a single Natural Language Processing (NLP) model wasn't enough to catch everything. General Named Entity Recognition (NER) models are great at finding people and locations, but they often miss specific PII patterns like email addresses or ID numbers.
To solve this, we implemented a composite pipeline in anonymize/core.py that loads two distinct Hugging Face models:
NER Model: elastic/distilbert-base-uncased-finetuned-conll03-english (Optimized for standard entities like Persons, Organizations, Locations).
PII Model: iiiorg/piiranha-v1-detect-personal-information (Specialized for sensitive personal data detection).
The Merge Logic
Running two models introduces a new problem: overlapping entities. If Model A says characters 10-15 are a [NAME] and Model B says characters 12-20 are an [EMAIL], naïve replacement will break the string.
We wrote a custom merge_entities algorithm in anonymize/text.py. It sorts entities by their start position and resolves conflicts by prioritizing the longest span and ensuring no "dead space" exists between merged entities.
This ensures that when we redraw the text, we get clean, non-overlapping tokens like [PER] or [LOC], rather than corrupted text.
Handling Structured Data: The CSV Challenge
Anonymizing a PDF is hard, but anonymizing a CSV or Excel file is deceptively difficult. You cannot simply pass a CSV string to an LLM or an NLP model; the lack of sentence structure confuses the context-aware models. We adopted a hybrid approach in anonymize/csv.py that combines heuristic rules with NLP, processed via multithreading for performance.
1. Multithreaded Row Processing
NLP inference is CPU-bound and slow. Processing a 10,000-row CSV sequentially would take forever. We utilized Python’s ThreadPoolExecutor to redact rows concurrently. # From anonymize/csv.py with ThreadPoolExecutor(max_workers=MAX_THREADS) as executor: futures = {executor.submit(redact_row, row): idx for idx, row in enumerate(df_sample)} # ... process results as they complete ...
2. Smart Heuristics
For structured data, we don't rely solely on the AI models. We implemented specific logic for common fields to save compute time and increase accuracy Numerics are ignored automatically to preserve financial context, a critical requirement for organizations using AI in financial reporting and compliance workflows.
Emails: Detected via regex patterns (@ symbols) and strictly redacted.
Headers: We check column headers. If a column is named "Full Name," we force a [PER] redaction even if the model is unsure.
3. Column-Level Thresholding
This is my favorite feature. Once we process a sample of the data, we calculate a "redaction ratio" for each column. If more than 80% of a column contains PII (e.g., a column of home addresses), we don't bother processing the rest—we wipe the entire column and replace it with the most common placeholder (e.g., [ADDRESS]) .
File Reconstruction
We support .docx, .pdf, and .xlsx. The challenge here isn't just reading the text, but putting it back together without breaking the file format.
This capability is particularly valuable for healthcare organizations, financial institutions, and enterprises processing sensitive documents.
For PDFs, we use a sequence of pdfplumber to extract text, pass it through our anonymize pipeline, and then use reportlab to generate a brand new PDF stream containing the safe, categorized text.
For Excel, we deconstruct the workbook into DataFrames, process them using the CSV logic mentioned above, and then reconstruct the workbook using openpyxl, preserving the sheet structure.
Why This Matters
By combining specialized NLP models with file-specific heuristics and robust concurrency, we turn "risky" data into "safe" context. This allows organizations to leverage advanced AI models without exposing sensitive personal information, helping support secure AI adoption across regulated industries.
The code is designed to be modular. Whether we are processing a raw string, a complex legal PDF, or a financial spreadsheet, the pipeline normalizes the input, sanitizes it, and reconstructs it, ensuring that Questa AI remains a secure gateway for enterprise AI.
Modern organizations want to benefit from AI without increasing privacy, security, or compliance risks. Data anonymization enables teams to safely process documents, spreadsheets, and records before they reach AI models. This approach helps organizations adopt AI while maintaining control over sensitive information.
Data anonymization enables organizations across healthcare, financial services, operations, and customer support to safely adopt AI while protecting sensitive information.
Technical Summary
Stack: Python, FastAPI, Hugging Face Transformers, PyTorch.
Models: DistilBERT (NER) + Piiranha (PII).
Concurrency: ThreadPoolExecutor for structured data.
Parsers: pdfplumber, openpyxl, python-docx.