The Text Anonymization Benchmark (TAB): A Dedicated Corpus and
Evaluation Framework for Text Anonymization
- URL: http://arxiv.org/abs/2202.00443v1
- Date: Tue, 25 Jan 2022 14:34:42 GMT
- Title: The Text Anonymization Benchmark (TAB): A Dedicated Corpus and
Evaluation Framework for Text Anonymization
- Authors: Ildik\'o Pil\'an, Pierre Lison, Lilja {\O}vrelid, Anthi Papadopoulou,
David S\'anchez and Montserrat Batet
- Abstract summary: We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods.
Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources.
This paper presents TAB (Text Anonymization Benchmark), a new, open-source annotated corpus developed to address this shortage.
- Score: 2.9849405664643585
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a novel benchmark and associated evaluation metrics for assessing
the performance of text anonymization methods. Text anonymization, defined as
the task of editing a text document to prevent the disclosure of personal
information, currently suffers from a shortage of privacy-oriented annotated
text resources, making it difficult to properly evaluate the level of privacy
protection offered by various anonymization methods. This paper presents TAB
(Text Anonymization Benchmark), a new, open-source annotated corpus developed
to address this shortage. The corpus comprises 1,268 English-language court
cases from the European Court of Human Rights (ECHR) enriched with
comprehensive annotations about the personal information appearing in each
document, including their semantic category, identifier type, confidential
attributes, and co-reference relations. Compared to previous work, the TAB
corpus is designed to go beyond traditional de-identification (which is limited
to the detection of predefined semantic categories), and explicitly marks which
text spans ought to be masked in order to conceal the identity of the person to
be protected. Along with presenting the corpus and its annotation layers, we
also propose a set of evaluation metrics that are specifically tailored towards
measuring the performance of text anonymization, both in terms of privacy
protection and utility preservation. We illustrate the use of the benchmark and
the proposed metrics by assessing the empirical performance of several baseline
text anonymization models. The full corpus along with its privacy-oriented
annotation guidelines, evaluation scripts and baseline models are available on:
https://github.com/NorskRegnesentral/text-anonymisation-benchmark
Related papers
- IncogniText: Privacy-enhancing Conditional Text Anonymization via LLM-based Private Attribute Randomization [8.483679748399037]
We propose IncogniText, a technique that anonymizes the text to mislead a potential adversary into predicting a wrong private attribute value.
Our empirical evaluation shows a reduction of private attribute leakage by more than 90%.
arXiv Detail & Related papers (2024-07-03T09:49:03Z) - PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation.
It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers.
It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z) - Neural Text Sanitization with Privacy Risk Indicators: An Empirical
Analysis [2.9311414545087366]
We consider a two-step approach to text sanitization and provide a detailed analysis of its empirical performance.
The text sanitization process starts with a privacy-oriented entity recognizer that seeks to determine the text spans expressing identifiable personal information.
We present five distinct indicators of the re-identification risk, respectively based on language model probabilities, text span classification, sequence labelling, perturbations, and web search.
arXiv Detail & Related papers (2023-10-22T14:17:27Z) - Unsupervised Text Deidentification [101.2219634341714]
We propose an unsupervised deidentification method that masks words that leak personally-identifying information.
Motivated by K-anonymity based privacy, we generate redactions that ensure a minimum reidentification rank.
arXiv Detail & Related papers (2022-10-20T18:54:39Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - Transparent Human Evaluation for Image Captioning [70.03979566548823]
We develop a rubric-based human evaluation protocol for image captioning models.
We show that human-generated captions show substantially higher quality than machine-generated ones.
We hope that this work will promote a more transparent evaluation protocol for image captioning.
arXiv Detail & Related papers (2021-11-17T07:09:59Z) - Protecting Anonymous Speech: A Generative Adversarial Network
Methodology for Removing Stylistic Indicators in Text [2.9005223064604078]
We develop a new approach to authorship anonymization by constructing a generative adversarial network.
Our fully automatic method achieves comparable results to other methods in terms of content preservation and fluency.
Our approach is able to generalize well to an open-set context and anonymize sentences from authors it has not encountered before.
arXiv Detail & Related papers (2021-10-18T17:45:56Z) - No Intruder, no Validity: Evaluation Criteria for Privacy-Preserving
Text Anonymization [0.48733623015338234]
We argue that researchers and practitioners developing automated text anonymization systems should carefully assess whether their evaluation methods truly reflect the system's ability to protect individuals from being re-identified.
We propose TILD, a set of evaluation criteria that comprises an anonymization method's technical performance, the information loss resulting from its anonymization, and the human ability to de-anonymize redacted documents.
arXiv Detail & Related papers (2021-03-16T18:18:29Z) - Global Attention for Name Tagging [56.62059996864408]
We present a new framework to improve name tagging by utilizing local, document-level, and corpus-level contextual information.
We propose a model that learns to incorporate document-level and corpus-level contextual information alongside local contextual information via global attentions.
Experiments on benchmark datasets show the effectiveness of our approach.
arXiv Detail & Related papers (2020-10-19T07:27:15Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.