An Evaluation Study of Hybrid Methods for Multilingual PII Detection
- URL: http://arxiv.org/abs/2510.07551v1
- Date: Wed, 08 Oct 2025 21:03:59 GMT
- Title: An Evaluation Study of Hybrid Methods for Multilingual PII Detection
- Authors: Harshit Rajgarhia, Suryam Gupta, Asif Shaik, Gulipalli Praveen Kumar, Y Santhoshraj, Sanka Nithya Tanvy Nishitha, Abhishek Mukherji,
- Abstract summary: We present RECAP, a framework that combines deterministic regular expressions with context-aware large language models (LLMs) for scalable PII detection.<n>Our system outperforms fine-tuned NER models by 82% and zero-shot LLMs by 17% in weighted F1-score.<n>This work offers a scalable and adaptable solution for efficient PII detection in compliance-focused applications.
- Score: 0.026059379504241156
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The detection of Personally Identifiable Information (PII) is critical for privacy compliance but remains challenging in low-resource languages due to linguistic diversity and limited annotated data. We present RECAP, a hybrid framework that combines deterministic regular expressions with context-aware large language models (LLMs) for scalable PII detection across 13 low-resource locales. RECAP's modular design supports over 300 entity types without retraining, using a three-phase refinement pipeline for disambiguation and filtering. Benchmarked with nervaluate, our system outperforms fine-tuned NER models by 82% and zero-shot LLMs by 17% in weighted F1-score. This work offers a scalable and adaptable solution for efficient PII detection in compliance-focused applications.
Related papers
- Binary Token-Level Classification with DeBERTa for All-Type MWE Identification: A Lightweight Approach with Linguistic Enhancement [1.8429656136522097]
We present a comprehensive approach for multiword expression (MWE) identification that combines binary token-level classification, linguistic feature integration, and data augmentation.<n>Our DeBERTa-v3-large model achieves 69.8% F1 on the CoAM dataset, surpassing the best results (Qwen-72B, 57.8% F1) on this dataset by 12 points while using 165x fewer parameters.
arXiv Detail & Related papers (2026-01-27T08:42:54Z) - Scalable multilingual PII annotation for responsible AI in LLMs [0.0917536845617986]
This work introduces a scalable multilingual data curation framework designed for high-quality PII annotation across 13 underrepresented locales.<n>Our phased, human-in-the-loop annotation methodology combines linguistic expertise with rigorous quality assurance, leading to substantial improvements in recall and false positive rates.
arXiv Detail & Related papers (2025-10-03T21:40:31Z) - Prompt-Based Simplification for Plain Language using Spanish Language Models [0.6299766708197881]
This paper describes the participation of HULAT-UC3M in CLEARS 2025 Subtask 1: Adaptation of Text to Plain Language (PL) in Spanish.<n>We explored strategies based on models trained on Spanish texts, including a zero-shot configuration using prompt engineering and a fine-tuned version with Low-Rank Adaptation (LoRA)<n>The final system was selected for its balanced and consistent performance, combining normalization steps, the RigoChat-7B-v2 model, and a dedicated PL-oriented prompt.
arXiv Detail & Related papers (2025-09-21T19:28:37Z) - SLRTP2025 Sign Language Production Challenge: Methodology, Results, and Future Work [87.9341538630949]
The first Sign Language Production Challenge was held as part of the third SLRTP Workshop at CVPR 2025.<n>The competition's aims are to evaluate architectures that translate from spoken language sentences to a sequence of skeleton poses.<n>This paper presents the challenge design and the winning methodologies.
arXiv Detail & Related papers (2025-08-09T11:57:33Z) - Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z) - Syntactic Control of Language Models by Posterior Inference [53.823006836309695]
Controlling the syntactic structure of text generated by language models is valuable for applications requiring clarity, stylistic consistency, or interpretability.<n>We argue that sampling algorithms based on the posterior inference can effectively enforce a target constituency structure during generation.<n>Our approach combines sequential Monte Carlo, which estimates the posterior distribution by sampling from a proposal distribution, with a syntactic tagger that ensures that each generated token aligns with the desired syntactic structure.
arXiv Detail & Related papers (2025-06-08T14:01:34Z) - The NTNU System at the S&I Challenge 2025 SLA Open Track [10.11220261280201]
We propose a system that integrates W2V with Phi-4 multimodal large language model (MLLM) through a score fusion strategy.<n>The proposed system achieves a root mean square error (RMSE) of 0.375 on the official test set of the Speak & Improve Challenge 2025.<n>For comparison, the RMSEs of the top-ranked, third-ranked, and official baseline systems are 0.364, 0.384, and 0.444, respectively.
arXiv Detail & Related papers (2025-06-05T15:09:23Z) - AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models [86.83875864328984]
We propose an automated method for synthesizing open-ended logic puzzles, and use it to develop a bilingual benchmark, AutoLogi.<n>Our approach features program-based verification and controllable difficulty levels, enabling more reliable evaluation that better distinguishes models' reasoning abilities.
arXiv Detail & Related papers (2025-02-24T07:02:31Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - On the N-gram Approximation of Pre-trained Language Models [17.764803904135903]
Large pre-trained language models (PLMs) have shown remarkable performance across various natural language understanding (NLU) tasks.
This study investigates the potential usage of PLMs for language modelling in Automatic Speech Recognition (ASR)
We compare the application of large-scale text sampling and probability conversion for approximating GPT-2 into an n-gram model.
arXiv Detail & Related papers (2023-06-12T06:42:08Z) - Text Classification via Large Language Models [63.1874290788797]
We introduce Clue And Reasoning Prompting (CARP) to address complex linguistic phenomena involved in text classification.
Remarkably, CARP yields new SOTA performances on 4 out of 5 widely-used text-classification benchmarks.
More importantly, we find that CARP delivers impressive abilities on low-resource and domain-adaptation setups.
arXiv Detail & Related papers (2023-05-15T06:24:45Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.