Privacy Guarantees for De-identifying Text Transformations
- URL: http://arxiv.org/abs/2008.03101v1
- Date: Fri, 7 Aug 2020 12:06:42 GMT
- Title: Privacy Guarantees for De-identifying Text Transformations
- Authors: David Ifeoluwa Adelani, Ali Davody, Thomas Kleinbauer, and Dietrich
Klakow
- Abstract summary: We derive formal privacy guarantees for text transformation-based de-identification methods on the basis of Differential Privacy.
We compare a simple redact approach with more sophisticated word-by-word replacement using deep learning models on multiple natural language understanding tasks.
We find that only word-by-word replacement is robust against performance drops in various tasks.
- Score: 17.636430224292866
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine Learning approaches to Natural Language Processing tasks benefit from
a comprehensive collection of real-life user data. At the same time, there is a
clear need for protecting the privacy of the users whose data is collected and
processed. For text collections, such as, e.g., transcripts of voice
interactions or patient records, replacing sensitive parts with benign
alternatives can provide de-identification. However, how much privacy is
actually guaranteed by such text transformations, and are the resulting texts
still useful for machine learning? In this paper, we derive formal privacy
guarantees for general text transformation-based de-identification methods on
the basis of Differential Privacy. We also measure the effect that different
ways of masking private information in dialog transcripts have on a subsequent
machine learning task. To this end, we formulate different masking strategies
and compare their privacy-utility trade-offs. In particular, we compare a
simple redact approach with more sophisticated word-by-word replacement using
deep learning models on multiple natural language understanding tasks like
named entity recognition, intent detection, and dialog act classification. We
find that only word-by-word replacement is robust against performance drops in
various tasks.
Related papers
- Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs.
We propose a more realistic setting in which only noisy text and its NER labels are available.
We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z) - Protecting Privacy in Classifiers by Token Manipulation [3.5033860596797965]
We focus on text classification models, examining various token mapping and contextualized manipulation functions.
We find that although some token mapping functions are easy and straightforward to implement, they heavily influence performance on the downstream task.
In comparison, the contextualized manipulation provides an improvement in performance.
arXiv Detail & Related papers (2024-07-01T14:41:59Z) - IDT: Dual-Task Adversarial Attacks for Privacy Protection [8.312362092693377]
Methods to protect privacy can involve using representations inside models that are not to detect sensitive attributes.
We propose IDT, a method that analyses predictions made by auxiliary and interpretable models to identify which tokens are important to change.
We evaluate different datasets for NLP suitable for different tasks.
arXiv Detail & Related papers (2024-06-28T04:14:35Z) - Just Rewrite It Again: A Post-Processing Method for Enhanced Semantic Similarity and Privacy Preservation of Differentially Private Rewritten Text [3.3916160303055567]
We propose a simple post-processing method based on the goal of aligning rewritten texts with their original counterparts.
Our results show that such an approach not only produces outputs that are more semantically reminiscent of the original inputs, but also texts which score on average better in empirical privacy evaluations.
arXiv Detail & Related papers (2024-05-30T08:41:33Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - Guiding Text-to-Text Privatization by Syntax [0.0]
Metric Differential Privacy is a generalization of differential privacy tailored to address the unique challenges of text-to-text privatization.
We analyze the capability of text-to-text privatization to preserve the grammatical category of words after substitution.
We transform the privatization step into a candidate selection problem in which substitutions are directed to words with matching grammatical properties.
arXiv Detail & Related papers (2023-06-02T11:52:21Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - Reading and Writing: Discriminative and Generative Modeling for
Self-Supervised Text Recognition [101.60244147302197]
We introduce contrastive learning and masked image modeling to learn discrimination and generation of text images.
Our method outperforms previous self-supervised text recognition methods by 10.2%-20.2% on irregular scene text recognition datasets.
Our proposed text recognizer exceeds previous state-of-the-art text recognition methods by averagely 5.3% on 11 benchmarks, with similar model size.
arXiv Detail & Related papers (2022-07-01T03:50:26Z) - Semantics-Preserved Distortion for Personal Privacy Protection in Information Management [65.08939490413037]
This paper suggests a linguistically-grounded approach to distort texts while maintaining semantic integrity.
We present two distinct frameworks for semantic-preserving distortion: a generative approach and a substitutive approach.
We also explore privacy protection in a specific medical information management scenario, showing our method effectively limits sensitive data memorization.
arXiv Detail & Related papers (2022-01-04T04:01:05Z) - CAPE: Context-Aware Private Embeddings for Private Language Learning [0.5156484100374058]
Context-Aware Private Embeddings (CAPE) is a novel approach which preserves privacy during training of embeddings.
CAPE applies calibrated noise through differential privacy, preserving the encoded semantic links while obscuring sensitive information.
Experimental results demonstrate that the proposed approach reduces private information leakage better than either single intervention.
arXiv Detail & Related papers (2021-08-27T14:50:12Z) - Protecting gender and identity with disentangled speech representations [49.00162808063399]
We show that protecting gender information in speech is more effective than modelling speaker-identity information.
We present a novel way to encode gender information and disentangle two sensitive biometric identifiers.
arXiv Detail & Related papers (2021-04-22T13:31:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.