Related papers: A Comparative Study of Light-weight Language Models for PII Masking and their Deployment for Real Conversational Texts

A Comparative Study of Light-weight Language Models for PII Masking and their Deployment for Real Conversational Texts

URL: http://arxiv.org/abs/2512.18608v1
Date: Sun, 21 Dec 2025 05:58:40 GMT
Title: A Comparative Study of Light-weight Language Models for PII Masking and their Deployment for Real Conversational Texts
Authors: Prabigya Acharya, Liza Shrestha,
Abstract summary: Automated masking of Personally Identifiable Information (PII) is critical for privacy-preserving conversational systems.<n>We compare encoder-decoder and decoder-only architectures by fine-tuning T5-small and Mistral-Instruct-v0.3 on English datasets constructed from the AI4Privacy benchmark.<n> Evaluation using entity-level and character-level metrics, type accuracy, and exact match shows that both lightweight models achieve performance comparable to frontier LLMs for PII masking tasks.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated masking of Personally Identifiable Information (PII) is critical for privacy-preserving conversational systems. While current frontier large language models demonstrate strong PII masking capabilities, concerns about data handling and computational costs motivate exploration of whether lightweight models can achieve comparable performance. We compare encoder-decoder and decoder-only architectures by fine-tuning T5-small and Mistral-Instruct-v0.3 on English datasets constructed from the AI4Privacy benchmark. We create different dataset variants to study label standardization and PII representation, covering 24 standardized PII categories and higher-granularity settings. Evaluation using entity-level and character-level metrics, type accuracy, and exact match shows that both lightweight models achieve performance comparable to frontier LLMs for PII masking tasks. Label normalization consistently improves performance across architectures. Mistral achieves higher F1 and recall with greater robustness across PII types but incurs significantly higher generation latency. T5, while less robust in conversational text, offers more controllable structured outputs and lower inference cost, motivating its use in a real-time Discord bot for real-world PII redaction. Evaluation on live messages reveals performance degradation under informal inputs. These results clarify trade-offs between accuracy, robustness, and computational efficiency, demonstrating that lightweight models can provide effective PII masking while addressing data handling concerns associated with frontier LLMs.

Related papers

CAPID: Context-Aware PII Detection for Question-Answering Systems [2.538582648751871]
We propose CAPID, a practical approach that fine-tunes a locally owned small language model (SLM) that filters sensitive information before it is passed to LLMs for QA.<n>Existing datasets do not capture the context-dependent relevance of PII needed to train such a model effectively.<n>Our experiments show that relevance-aware PII detection with a fine-tuned SLM substantially outperforms existing baselines in span, relevance and type accuracy.
arXiv Detail & Related papers (2026-02-10T18:41:31Z)
Cost-Aware Model Selection for Text Classification: Multi-Objective Trade-offs Between Fine-Tuned Encoders and LLM Prompting in Production [0.0]
Large language models (LLMs) have demonstrated strong capabilities in open-ended reasoning and generative language tasks.<n>For structured text classification problems with fixed label spaces, model selection is often driven by predictive performance alone.<n>We show that fine-tuned encoder-based models from the BERT family achieve competitive, and often superior, classification performance.
arXiv Detail & Related papers (2026-02-06T03:54:28Z)
Information Capacity: Evaluating the Efficiency of Large Language Models via Text Compression [53.39128997308138]
We introduce information capacity, a measure of model efficiency based on text compression performance.<n> Empirical evaluations on mainstream open-source models show that models of varying sizes within a series exhibit consistent information capacity.<n>A distinctive feature of information capacity is that it incorporates tokenizer efficiency, which affects both input and output token counts.
arXiv Detail & Related papers (2025-11-11T10:07:32Z)
Detect, Explain, Escalate: Low-Carbon Dialogue Breakdown Management for LLM-Powered Agents [30.13634341221476]
Large Language Models (LLMs) are transforming numerous applications, but their susceptibility to conversational breakdowns remains a critical challenge undermining user trust.<n>This paper introduces a "Detect, Explain, Escalate" framework to manage dialogue breakdowns in LLM-powered agents, emphasizing low-carbon operation.
arXiv Detail & Related papers (2025-04-26T07:51:05Z)
Discourse Features Enhance Detection of Document-Level Machine-Generated Content [53.41994768824785]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.<n>Existing MGC detectors often focus solely on surface-level information, overlooking implicit and structural features.<n>We introduce novel methodologies and datasets to overcome these challenges.
arXiv Detail & Related papers (2024-12-17T08:47:41Z)
RIRO: Reshaping Inputs, Refining Outputs Unlocking the Potential of Large Language Models in Data-Scarce Contexts [0.0]
Large language models (LLMs) have significantly advanced natural language processing, excelling in areas like text generation, summarization, and question-answering.<n>Despite their capabilities, these models face challenges when fine-tuned on small, domain-specific datasets.<n>We introduce RIRO, a novel two-layer architecture designed to improve performance in data-scarce environments.
arXiv Detail & Related papers (2024-12-15T15:48:37Z)
The Inter-Intra Modal Measure: A Predictive Lens on Fine-Tuning Outcomes in Vision-Language Models [6.7181844004432385]
We introduce the Inter-Intra Modal Measure (IIMM) - a predictive metric that quantifies the relationship between intra-modal image embedding similarity and inter-modal misalignment.<n>Compared to existing transferability measures, the IIMM demonstrates significantly stronger predictive power for accuracy changes post fine-tuning in dual-encoder models.<n>We provide a theoretical bound, proving that changes in IIMM are limited by the Wasserstein distance between pre- and post-fine-tuning embedding.
arXiv Detail & Related papers (2024-07-22T15:35:09Z)
A Context-Aware Approach for Enhancing Data Imputation with Pre-trained Language Models [0.18416014644193068]
CRILM uses pre-trained language models to create contextually relevant descriptors for missing values.<n>Our evaluations demonstrate CRILM's superior performance and robustness across MCAR, MAR, and challenging MNAR scenarios.
arXiv Detail & Related papers (2024-05-28T00:08:29Z)
FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
Click-through rate (CTR) prediction plays as a core function module in personalized online services. Traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality. Pretrained Language Models(PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality. We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models(FLIP) for CTR prediction.
arXiv Detail & Related papers (2023-10-30T11:25:03Z)
Simultaneous Machine Translation with Large Language Models [51.470478122113356]
We investigate the possibility of applying Large Language Models to SimulMT tasks. We conducted experiments using the textttLlama2-7b-chat model on nine different languages from the MUST-C dataset. The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics.
arXiv Detail & Related papers (2023-09-13T04:06:47Z)
The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations. We study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z)
Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing. LLMs are extremely computationally expensive, even at inference time. We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z)
Mixture of Soft Prompts for Controllable Data Generation [21.84489422361048]
Mixture of Soft Prompts (MSP) is proposed as a tool for data augmentation rather than direct prediction. Our method achieves state-of-the-art results on three benchmarks when compared against strong baselines.
arXiv Detail & Related papers (2023-03-02T21:13:56Z)
Pre-trained Language Models for Keyphrase Generation: A Thorough Empirical Study [76.52997424694767]
We present an in-depth empirical study of keyphrase extraction and keyphrase generation using pre-trained language models. We show that PLMs have competitive high-resource performance and state-of-the-art low-resource performance. Further results show that in-domain BERT-like PLMs can be used to build strong and data-efficient keyphrase generation models.
arXiv Detail & Related papers (2022-12-20T13:20:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.