Related papers: Privacy- and Utility-Preserving NLP with Anonymized Data: A case study of Pseudonymization

Privacy- and Utility-Preserving NLP with Anonymized Data: A case study of Pseudonymization

URL: http://arxiv.org/abs/2306.05561v1
Date: Thu, 8 Jun 2023 21:06:19 GMT
Title: Privacy- and Utility-Preserving NLP with Anonymized Data: A case study of Pseudonymization
Authors: Oleksandr Yermilov, Vipul Raheja, Artem Chernodub
Abstract summary: Our work provides crucial insights into the gaps between original and anonymized data. We make our code, pseudonymized datasets, and downstream models publicly available.
Score: 22.84767881115746
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work investigates the effectiveness of different pseudonymization techniques, ranging from rule-based substitutions to using pre-trained Large Language Models (LLMs), on a variety of datasets and models used for two widely used NLP tasks: text classification and summarization. Our work provides crucial insights into the gaps between original and anonymized data (focusing on the pseudonymization technique) and model quality and fosters future research into higher-quality anonymization techniques to better balance the trade-offs between data protection and utility preservation. We make our code, pseudonymized datasets, and downstream models publicly available

Related papers

Multi-Objective Optimization-Based Anonymization of Structured Data for Machine Learning [0.5452584641316627]
Our research identifies key limitations in existing optimization models for privacy preservation. We propose a novel multi-objective optimization model that simultaneously minimizes information loss and maximizes protection against attacks.
arXiv Detail & Related papers (2025-01-02T01:52:36Z)
Privacy-preserving datasets by capturing feature distributions with Conditional VAEs [0.11999555634662634]
Conditional Variational Autoencoders (CVAEs) trained on feature vectors extracted from large pre-trained vision foundation models. Our method notably outperforms traditional approaches in both medical and natural image domains. Results underscore the potential of generative models to significantly impact deep learning applications in data-scarce and privacy-sensitive environments.
arXiv Detail & Related papers (2024-08-01T15:26:24Z)
Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Text anonymization is crucial for sharing sensitive data while maintaining privacy. Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models. This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z)
Benchmarking Advanced Text Anonymisation Methods: A Comparative Study on Novel and Traditional Approaches [5.891554349884001]
This paper compares the performance of transformer-based models and Large Language Models against traditional architectures for text anonymisation. Our results showcase the strengths and weaknesses of each approach, offering a clear perspective on the efficacy of modern versus traditional methods.
arXiv Detail & Related papers (2024-04-22T12:06:54Z)
Assessing Privacy Risks in Language Models: A Case Study on Summarization Tasks [65.21536453075275]
We focus on the summarization task and investigate the membership inference (MI) attack. We exploit text similarity and the model's resistance to document modifications as potential MI signals. We discuss several safeguards for training summarization models to protect against MI attacks and discuss the inherent trade-off between privacy and utility.
arXiv Detail & Related papers (2023-10-20T05:44:39Z)
PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind) Our work offers a theoretical analysis for model design and benchmarks various techniques. In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z)
Comparison of machine learning models applied on anonymized data with different techniques [0.0]
We study four classical machine learning methods currently used for classification purposes in order to analyze the results as a function of the anonymization techniques applied and the parameters selected for each of them. The performance of these models is studied when varying the value of k for k-anonymity and additional tools such as $ell$-diversity, t-closeness and $delta$-disclosure privacy are also deployed on the well-known adult dataset.
arXiv Detail & Related papers (2023-05-12T12:34:07Z)
Pre-trained Language Models for Keyphrase Generation: A Thorough Empirical Study [76.52997424694767]
We present an in-depth empirical study of keyphrase extraction and keyphrase generation using pre-trained language models. We show that PLMs have competitive high-resource performance and state-of-the-art low-resource performance. Further results show that in-domain BERT-like PLMs can be used to build strong and data-efficient keyphrase generation models.
arXiv Detail & Related papers (2022-12-20T13:20:21Z)
Style Transfer as Data Augmentation: A Case Study on Named Entity Recognition [17.892385961143173]
We propose a new method to transform the text from a high-resource domain to a low-resource domain by changing its style-related attributes. We design a constrained decoding algorithm along with a set of key ingredients for data selection to guarantee the generation of valid and coherent data. Our approach is a practical solution to data scarcity, and we expect it to be applicable to other NLP tasks.
arXiv Detail & Related papers (2022-10-14T16:02:03Z)
Cluster-level pseudo-labelling for source-free cross-domain facial expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER) Our method exploits self-supervised pretraining to learn good feature representations from the target data. We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z)
Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings. We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data. We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.