Neural Text Sanitization with Privacy Risk Indicators: An Empirical
Analysis
- URL: http://arxiv.org/abs/2310.14312v1
- Date: Sun, 22 Oct 2023 14:17:27 GMT
- Title: Neural Text Sanitization with Privacy Risk Indicators: An Empirical
Analysis
- Authors: Anthi Papadopoulou, Pierre Lison, Mark Anderson, Lilja {\O}vrelid,
Ildik\'o Pil\'an
- Abstract summary: We consider a two-step approach to text sanitization and provide a detailed analysis of its empirical performance.
The text sanitization process starts with a privacy-oriented entity recognizer that seeks to determine the text spans expressing identifiable personal information.
We present five distinct indicators of the re-identification risk, respectively based on language model probabilities, text span classification, sequence labelling, perturbations, and web search.
- Score: 2.9311414545087366
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text sanitization is the task of redacting a document to mask all occurrences
of (direct or indirect) personal identifiers, with the goal of concealing the
identity of the individual(s) referred in it. In this paper, we consider a
two-step approach to text sanitization and provide a detailed analysis of its
empirical performance on two recently published datasets: the Text
Anonymization Benchmark (Pil\'an et al., 2022) and a collection of Wikipedia
biographies (Papadopoulou et al., 2022). The text sanitization process starts
with a privacy-oriented entity recognizer that seeks to determine the text
spans expressing identifiable personal information. This privacy-oriented
entity recognizer is trained by combining a standard named entity recognition
model with a gazetteer populated by person-related terms extracted from
Wikidata. The second step of the text sanitization process consists in
assessing the privacy risk associated with each detected text span, either
isolated or in combination with other text spans. We present five distinct
indicators of the re-identification risk, respectively based on language model
probabilities, text span classification, sequence labelling, perturbations, and
web search. We provide a contrastive analysis of each privacy indicator and
highlight their benefits and limitations, notably in relation to the available
labeled data.
Related papers
- From Attributes to Natural Language: A Survey and Foresight on Text-based Person Re-identification [4.400729890122927]
The aim of text-based person Re-ID is to recognize specific pedestrians by scrutinizing attributes/natural language descriptions.
There is a notable absence of comprehensive reviews dedicated to summarizing the text-based person Re-ID from a technical perspective.
We introduce a taxonomy spanning Evaluation, Strategy, Architecture, and Optimization dimensions, providing a comprehensive survey of the text-based person Re-ID task.
arXiv Detail & Related papers (2024-07-31T18:16:18Z) - Spotting AI's Touch: Identifying LLM-Paraphrased Spans in Text [61.22649031769564]
We propose a novel framework, paraphrased text span detection (PTD)
PTD aims to identify paraphrased text spans within a text.
We construct a dedicated dataset, PASTED, for paraphrased text span detection.
arXiv Detail & Related papers (2024-05-21T11:22:27Z) - X-PARADE: Cross-Lingual Textual Entailment and Information Divergence across Paragraphs [55.80189506270598]
X-PARADE is the first cross-lingual dataset of paragraph-level information divergences.
Annotators label a paragraph in a target language at the span level and evaluate it with respect to a corresponding paragraph in a source language.
Aligned paragraphs are sourced from Wikipedia pages in different languages.
arXiv Detail & Related papers (2023-09-16T04:34:55Z) - Multiview Identifiers Enhanced Generative Retrieval [78.38443356800848]
generative retrieval generates identifier strings of passages as the retrieval target.
We propose a new type of identifier, synthetic identifiers, that are generated based on the content of a passage.
Our proposed approach performs the best in generative retrieval, demonstrating its effectiveness and robustness.
arXiv Detail & Related papers (2023-05-26T06:50:21Z) - PropSegmEnt: A Large-Scale Corpus for Proposition-Level Segmentation and
Entailment Recognition [63.51569687229681]
We argue for the need to recognize the textual entailment relation of each proposition in a sentence individually.
We propose PropSegmEnt, a corpus of over 45K propositions annotated by expert human raters.
Our dataset structure resembles the tasks of (1) segmenting sentences within a document to the set of propositions, and (2) classifying the entailment relation of each proposition with respect to a different yet topically-aligned document.
arXiv Detail & Related papers (2022-12-21T04:03:33Z) - An Easy-to-use and Robust Approach for the Differentially Private
De-Identification of Clinical Textual Documents [0.0]
This paper shows how an efficient and differentially private de-identification approach can be achieved by strengthening the less robust de-identification.
The result is an approach for de-identifying clinical documents in French language, but also generalizable to other languages.
arXiv Detail & Related papers (2022-11-02T14:25:09Z) - Unsupervised Text Deidentification [101.2219634341714]
We propose an unsupervised deidentification method that masks words that leak personally-identifying information.
Motivated by K-anonymity based privacy, we generate redactions that ensure a minimum reidentification rank.
arXiv Detail & Related papers (2022-10-20T18:54:39Z) - Textual Entailment Recognition with Semantic Features from Empirical
Text Representation [60.31047947815282]
A text entails a hypothesis if and only if the true value of the hypothesis follows the text.
In this paper, we propose a novel approach to identifying the textual entailment relationship between text and hypothesis.
We employ an element-wise Manhattan distance vector-based feature that can identify the semantic entailment relationship between the text-hypothesis pair.
arXiv Detail & Related papers (2022-10-18T10:03:51Z) - The Text Anonymization Benchmark (TAB): A Dedicated Corpus and
Evaluation Framework for Text Anonymization [2.9849405664643585]
We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods.
Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources.
This paper presents TAB (Text Anonymization Benchmark), a new, open-source annotated corpus developed to address this shortage.
arXiv Detail & Related papers (2022-01-25T14:34:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.