RAT-Bench: A Comprehensive Benchmark for Text Anonymization
- URL: http://arxiv.org/abs/2602.12806v1
- Date: Fri, 13 Feb 2026 10:41:44 GMT
- Title: RAT-Bench: A Comprehensive Benchmark for Text Anonymization
- Authors: Nataša Krčo, Zexi Yao, Matthieu Meeus, Yves-Alexandre de Montjoye,
- Abstract summary: We introduce RAT-Bench, a benchmark for text anonymization tools based on re-identification risk.<n>We generate synthetic text containing various direct and indirect identifiers across domains, languages, and difficulty levels.<n>We find that, while capabilities vary widely, even the best tools are far from perfect in particular when direct identifiers are not written in standard ways.
- Score: 8.64925947747086
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data containing personal information is increasingly used to train, fine-tune, or query Large Language Models (LLMs). Text is typically scrubbed of identifying information prior to use, often with tools such as Microsoft's Presidio or Anthropic's PII purifier. These tools have traditionally been evaluated on their ability to remove specific identifiers (e.g., names), yet their effectiveness at preventing re-identification remains unclear. We introduce RAT-Bench, a comprehensive benchmark for text anonymization tools based on re-identification risk. Using U.S. demographic statistics, we generate synthetic text containing various direct and indirect identifiers across domains, languages, and difficulty levels. We evaluate a range of NER- and LLM-based text anonymization tools and, based on the attributes an LLM-based attacker is able to correctly infer from the anonymized text, we report the risk of re-identification in the U.S. population, while properly accounting for the disparate impact of identifiers. We find that, while capabilities vary widely, even the best tools are far from perfect in particular when direct identifiers are not written in standard ways and when indirect identifiers enable re-identification. Overall we find LLM-based anonymizers, including new iterative anonymizers, to provide a better privacy-utility trade-off albeit at a higher computational cost. Importantly, we also find them to work well across languages. We conclude with recommendations for future anonymization tools and will release the benchmark and encourage community efforts to expand it, in particular to other geographies.
Related papers
- Local Language Models for Context-Aware Adaptive Anonymization of Sensitive Text [0.7349727826230863]
This study uses local LLMs to build a reliable, repeatable, and context-aware anonymization process.<n>We introduce a Structured Framework for Adaptive Anonymizer (SFAA) that includes three steps: detection, classification, and adaptive anonymization.
arXiv Detail & Related papers (2026-01-21T05:59:56Z) - Unleashing the Native Recommendation Potential: LLM-Based Generative Recommendation via Structured Term Identifiers [51.64398574262054]
This paper introduces Term IDs (TIDs), defined as a set of semantically rich and standardized textual keywords, to serve as robust item identifiers.<n>We propose GRLM, a novel framework centered on TIDs, to convert item's metadata into standardized TIDs and utilize Integrative Instruction Fine-tuning to collaboratively optimize term internalization and sequential recommendation.
arXiv Detail & Related papers (2026-01-11T07:53:20Z) - AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text [8.758843436588297]
AgentStealth is a self-reinforcing language model for text anonymization.<n>We show that our method outperforms baselines in both anonymization effectiveness and utility.<n>Our lightweight design supports direct deployment on edge devices, avoiding cloud reliance and communication-based privacy risks.
arXiv Detail & Related papers (2025-06-26T02:48:16Z) - Self-Refining Language Model Anonymizers via Adversarial Distillation [48.280759014096354]
We introduce SElf-refining Anonymization with Language model (SEAL)<n>SEAL is a novel distillation framework for training small language models (SLMs) to perform effective anonymization without relying on external models at inference time.<n>Experiments on SynthPAI, a dataset of synthetic personal profiles and text comments, demonstrate that SLMs trained with SEAL achieve substantial improvements in anonymization capabilities.
arXiv Detail & Related papers (2025-06-02T08:21:27Z) - AIDBench: A benchmark for evaluating the authorship identification capability of large language models [14.866356328321126]
We focus on a specific privacy risk where large language models (LLMs) may help identify the authorship of anonymous texts.
We present AIDBench, a new benchmark that incorporates several author identification datasets, including emails, blogs, reviews, articles, and research papers.
Our experiments with AIDBench demonstrate that LLMs can correctly guess authorship at rates well above random chance, revealing new privacy risks posed by these powerful models.
arXiv Detail & Related papers (2024-11-20T11:41:08Z) - Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Anonymizing text that contains sensitive information is crucial for a wide range of applications.<n>Existing techniques face the emerging challenges of the re-identification ability of large language models.<n>We propose a framework composed of three key components: a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z) - Multiview Identifiers Enhanced Generative Retrieval [78.38443356800848]
generative retrieval generates identifier strings of passages as the retrieval target.
We propose a new type of identifier, synthetic identifiers, that are generated based on the content of a passage.
Our proposed approach performs the best in generative retrieval, demonstrating its effectiveness and robustness.
arXiv Detail & Related papers (2023-05-26T06:50:21Z) - Disambiguation of Company names via Deep Recurrent Networks [101.90357454833845]
We propose a Siamese LSTM Network approach to extract -- via supervised learning -- an embedding of company name strings.
We analyse how an Active Learning approach to prioritise the samples to be labelled leads to a more efficient overall learning pipeline.
arXiv Detail & Related papers (2023-03-07T15:07:57Z) - Unsupervised Text Deidentification [101.2219634341714]
We propose an unsupervised deidentification method that masks words that leak personally-identifying information.
Motivated by K-anonymity based privacy, we generate redactions that ensure a minimum reidentification rank.
arXiv Detail & Related papers (2022-10-20T18:54:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.