Privacy- and Utility-Preserving NLP with Anonymized Data: A case study
of Pseudonymization
- URL: http://arxiv.org/abs/2306.05561v1
- Date: Thu, 8 Jun 2023 21:06:19 GMT
- Title: Privacy- and Utility-Preserving NLP with Anonymized Data: A case study
of Pseudonymization
- Authors: Oleksandr Yermilov, Vipul Raheja, Artem Chernodub
- Abstract summary: Our work provides crucial insights into the gaps between original and anonymized data.
We make our code, pseudonymized datasets, and downstream models publicly available.
- Score: 22.84767881115746
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work investigates the effectiveness of different pseudonymization
techniques, ranging from rule-based substitutions to using pre-trained Large
Language Models (LLMs), on a variety of datasets and models used for two widely
used NLP tasks: text classification and summarization. Our work provides
crucial insights into the gaps between original and anonymized data (focusing
on the pseudonymization technique) and model quality and fosters future
research into higher-quality anonymization techniques to better balance the
trade-offs between data protection and utility preservation. We make our code,
pseudonymized datasets, and downstream models publicly available
Related papers
- Privacy-preserving datasets by capturing feature distributions with Conditional VAEs [0.11999555634662634]
Conditional Variational Autoencoders (CVAEs) trained on feature vectors extracted from large pre-trained vision foundation models.
Our method notably outperforms traditional approaches in both medical and natural image domains.
Results underscore the potential of generative models to significantly impact deep learning applications in data-scarce and privacy-sensitive environments.
arXiv Detail & Related papers (2024-08-01T15:26:24Z) - Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Text anonymization is crucial for sharing sensitive data while maintaining privacy.
Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models.
This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z) - Benchmarking Advanced Text Anonymisation Methods: A Comparative Study on Novel and Traditional Approaches [5.891554349884001]
This paper compares the performance of transformer-based models and Large Language Models against traditional architectures for text anonymisation.
Our results showcase the strengths and weaknesses of each approach, offering a clear perspective on the efficacy of modern versus traditional methods.
arXiv Detail & Related papers (2024-04-22T12:06:54Z) - Assessing Privacy Risks in Language Models: A Case Study on
Summarization Tasks [65.21536453075275]
We focus on the summarization task and investigate the membership inference (MI) attack.
We exploit text similarity and the model's resistance to document modifications as potential MI signals.
We discuss several safeguards for training summarization models to protect against MI attacks and discuss the inherent trade-off between privacy and utility.
arXiv Detail & Related papers (2023-10-20T05:44:39Z) - PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - Comparison of machine learning models applied on anonymized data with
different techniques [0.0]
We study four classical machine learning methods currently used for classification purposes in order to analyze the results as a function of the anonymization techniques applied and the parameters selected for each of them.
The performance of these models is studied when varying the value of k for k-anonymity and additional tools such as $ell$-diversity, t-closeness and $delta$-disclosure privacy are also deployed on the well-known adult dataset.
arXiv Detail & Related papers (2023-05-12T12:34:07Z) - Pre-trained Language Models for Keyphrase Generation: A Thorough
Empirical Study [76.52997424694767]
We present an in-depth empirical study of keyphrase extraction and keyphrase generation using pre-trained language models.
We show that PLMs have competitive high-resource performance and state-of-the-art low-resource performance.
Further results show that in-domain BERT-like PLMs can be used to build strong and data-efficient keyphrase generation models.
arXiv Detail & Related papers (2022-12-20T13:20:21Z) - Style Transfer as Data Augmentation: A Case Study on Named Entity
Recognition [17.892385961143173]
We propose a new method to transform the text from a high-resource domain to a low-resource domain by changing its style-related attributes.
We design a constrained decoding algorithm along with a set of key ingredients for data selection to guarantee the generation of valid and coherent data.
Our approach is a practical solution to data scarcity, and we expect it to be applicable to other NLP tasks.
arXiv Detail & Related papers (2022-10-14T16:02:03Z) - Cluster-level pseudo-labelling for source-free cross-domain facial
expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER)
Our method exploits self-supervised pretraining to learn good feature representations from the target data.
We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.