Related papers: CAPID: Context-Aware PII Detection for Question-Answering Systems

CAPID: Context-Aware PII Detection for Question-Answering Systems

URL: http://arxiv.org/abs/2602.10074v1
Date: Tue, 10 Feb 2026 18:41:31 GMT
Title: CAPID: Context-Aware PII Detection for Question-Answering Systems
Authors: Mariia Ponomarenko, Sepideh Abedini, Masoumeh Shafieinejad, D. B. Emerson, Shubhankar Mohapatra, Xi He,
Abstract summary: We propose CAPID, a practical approach that fine-tunes a locally owned small language model (SLM) that filters sensitive information before it is passed to LLMs for QA.<n>Existing datasets do not capture the context-dependent relevance of PII needed to train such a model effectively.<n>Our experiments show that relevance-aware PII detection with a fine-tuned SLM substantially outperforms existing baselines in span, relevance and type accuracy.
Score: 2.538582648751871
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Detecting personally identifiable information (PII) in user queries is critical for ensuring privacy in question-answering systems. Current approaches mainly redact all PII, disregarding the fact that some of them may be contextually relevant to the user's question, resulting in a degradation of response quality. Large language models (LLMs) might be able to help determine which PII are relevant, but due to their closed source nature and lack of privacy guarantees, they are unsuitable for sensitive data processing. To achieve privacy-preserving PII detection, we propose CAPID, a practical approach that fine-tunes a locally owned small language model (SLM) that filters sensitive information before it is passed to LLMs for QA. However, existing datasets do not capture the context-dependent relevance of PII needed to train such a model effectively. To fill this gap, we propose a synthetic data generation pipeline that leverages LLMs to produce a diverse, domain-rich dataset spanning multiple PII types and relevance levels. Using this dataset, we fine-tune an SLM to detect PII spans, classify their types, and estimate contextual relevance. Our experiments show that relevance-aware PII detection with a fine-tuned SLM substantially outperforms existing baselines in span, relevance and type accuracy while preserving significantly higher downstream utility under anonymization.

Related papers

PRvL: Quantifying the Capabilities and Risks of Large Language Models for PII Redaction [0.7421845364041001]
Redaction of Personally Identifiable Information (PII) from unstructured text is critical for ensuring data privacy in regulated domains.<n>Recent advances in Large Language Models (LLMs) offer a promising alternative.<n>We present a comprehensive analysis of LLMs as privacy-preserving PII Redaction systems.<n>We release PRvL, an open-source suite of fine-tuned models, and evaluation tools for general-purpose PII Redaction.
arXiv Detail & Related papers (2025-08-07T16:22:49Z)
PII-Bench: Evaluating Query-Aware Privacy Protection Systems [10.52362814808073]
We propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems.<n>PII-Bench comprises 2,842 test samples across 55 fine-grained PII categories, featuring diverse scenarios from single-subject descriptions to complex multi-party interactions.
arXiv Detail & Related papers (2025-02-25T14:49:08Z)
FedDTPT: Federated Discrete and Transferable Prompt Tuning for Black-Box Large Language Models [14.719919025265224]
Fine-tuning large language models (LLMs) with data from specific scenarios poses privacy leakage risks. We propose for the first time a federated discrete and transferable prompt tuning, namely FedDTPT, for black-box large language models. Our approach achieves higher accuracy, reduced communication overhead, and robustness to non-iid data in a black-box setting.
arXiv Detail & Related papers (2024-11-01T19:19:23Z)
Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Anonymizing text that contains sensitive information is crucial for a wide range of applications.<n>Existing techniques face the emerging challenges of the re-identification ability of large language models.<n>We propose a framework composed of three key components: a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z)
Query Performance Prediction using Relevance Judgments Generated by Large Language Models [53.97064615557883]
We propose a new Query performance prediction (QPP) framework using automatically generated relevance judgments (QPP-GenRE)<n>QPP-GenRE decomposes QPP into independent subtasks of predicting relevance of each item in a ranked list to a given query.<n>We predict an item's relevance by using open-source large language models (LLMs) to ensure scientific relevance.
arXiv Detail & Related papers (2024-04-01T09:33:05Z)
Enhancing Information Maximization with Distance-Aware Contrastive Learning for Source-Free Cross-Domain Few-Shot Learning [55.715623885418815]
Cross-Domain Few-Shot Learning methods require access to source domain data to train a model in the pre-training phase. Due to increasing concerns about data privacy and the desire to reduce data transmission and training costs, it is necessary to develop a CDFSL solution without accessing source data. This paper proposes an Enhanced Information Maximization with Distance-Aware Contrastive Learning method to address these challenges.
arXiv Detail & Related papers (2024-03-04T12:10:24Z)
Learning to Filter Context for Retrieval-Augmented Generation [75.18946584853316]
Generation models are required to generate outputs given partially or entirely irrelevant passages. FILCO identifies useful context based on lexical and information-theoretic approaches. It trains context filtering models that can filter retrieved contexts at test time.
arXiv Detail & Related papers (2023-11-14T18:41:54Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind) Our work offers a theoretical analysis for model design and benchmarks various techniques. In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z)
ProPILE: Probing Privacy Leakage in Large Language Models [38.92840523665835]
Large language models (LLMs) are often trained on vast quantities of web-collected data, which may inadvertently include sensitive personal data. This paper presents ProPILE, a novel probing tool designed to empower data subjects, or the owners of the PII, with awareness of potential PII leakage.
arXiv Detail & Related papers (2023-07-04T18:53:47Z)
SEAM: Searching Transferable Mixed-Precision Quantization Policy through Large Margin Regularization [50.04951511146338]
Mixed-precision quantization (MPQ) suffers from the time-consuming process of searching the optimal bit-width allocation for each layer. This paper proposes a novel method for efficiently searching for effective MPQ policies using a small proxy dataset.
arXiv Detail & Related papers (2023-02-14T05:47:45Z)
MAPS: A Noise-Robust Progressive Learning Approach for Source-Free Domain Adaptive Keypoint Detection [76.97324120775475]
Cross-domain keypoint detection methods always require accessing the source data during adaptation. This paper considers source-free domain adaptive keypoint detection, where only the well-trained source model is provided to the target domain.
arXiv Detail & Related papers (2023-02-09T12:06:08Z)
On Taking Advantage of Opportunistic Meta-knowledge to Reduce Configuration Spaces for Automated Machine Learning [11.670797168818773]
Key research question is whether it is possible and practical to preemptively avoid costly evaluations of poorly performing ML pipelines. Numerous experiments with the AutoWeka4MCPS package suggest that opportunistic/systematic meta-knowledge can improve ML outcomes. We observe strong sensitivity to the challenge' of a dataset, i.e. whether specificity in choosing a predictor leads to significantly better performance.
arXiv Detail & Related papers (2022-08-08T19:22:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.