NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human
- URL: http://arxiv.org/abs/2406.03749v1
- Date: Thu, 6 Jun 2024 05:07:44 GMT
- Title: NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human
- Authors: Shuo Huang, William MacLean, Xiaoxi Kang, Anqi Wu, Lizhen Qu, Qiongkai Xu, Zhuang Li, Xingliang Yuan, Gholamreza Haffari,
- Abstract summary: We suggest sanitizing sensitive text using two common strategies used by humans.
We curate the first corpus, coined NAP2, through both crowdsourcing and the use of large language models.
- Score: 55.20137833039499
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Increasing concerns about privacy leakage issues in academia and industry arise when employing NLP models from third-party providers to process sensitive texts. To protect privacy before sending sensitive data to those models, we suggest sanitizing sensitive text using two common strategies used by humans: i) deleting sensitive expressions, and ii) obscuring sensitive details by abstracting them. To explore the issues and develop a tool for text rewriting, we curate the first corpus, coined NAP^2, through both crowdsourcing and the use of large language models (LLMs). Compared to the prior works based on differential privacy, which lead to a sharp drop in information utility and unnatural texts, the human-inspired approaches result in more natural rewrites and offer an improved balance between privacy protection and data utility, as demonstrated by our extensive experiments.
Related papers
- Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory [82.7042006247124]
We show that even the most capable AI models reveal private information in contexts that humans would not, 39% and 57% of the time, respectively.
Our work underscores the immediate need to explore novel inference-time privacy-preserving approaches, based on reasoning and theory of mind.
arXiv Detail & Related papers (2023-10-27T04:15:30Z) - ChatGPT for Us: Preserving Data Privacy in ChatGPT via Dialogue Text
Ambiguation to Expand Mental Health Care Delivery [52.73936514734762]
ChatGPT has gained popularity for its ability to generate human-like dialogue.
Data-sensitive domains face challenges in using ChatGPT due to privacy and data-ownership concerns.
We propose a text ambiguation framework that preserves user privacy.
arXiv Detail & Related papers (2023-05-19T02:09:52Z) - Planting and Mitigating Memorized Content in Predictive-Text Language
Models [11.911353678499008]
Language models are widely deployed to provide automatic text completion services in user products.
Recent research has revealed that language models bear considerable risk of memorizing private training data.
In this study, we test the efficacy of a range of privacy-preserving techniques to mitigate unintended memorization of sensitive user text.
arXiv Detail & Related papers (2022-12-16T17:57:14Z) - User-Entity Differential Privacy in Learning Natural Language Models [46.177052564590646]
We introduce a novel concept of user-entity differential privacy (UeDP) to provide formal privacy protection simultaneously to both sensitive entities in textual data and data owners in learning natural language models (NLMs)
To preserve UeDP, we developed a novel algorithm, called UeDP-Alg, optimizing the trade-off between privacy loss and model utility with a tight bound sensitivity derived from seamlessly combining user and sensitive entity sampling processes.
An extensive theoretical analysis and evaluation show that our UeDP-Alg outperforms baseline approaches in model utility under the same privacy budget consumption on several NLM tasks, using benchmark datasets
arXiv Detail & Related papers (2022-11-01T16:54:23Z) - Synthetic Text Generation with Differential Privacy: A Simple and
Practical Recipe [32.63295550058343]
We show that a simple and practical recipe in the text domain is effective in generating useful synthetic text with strong privacy protection.
Our method produces synthetic text that is competitive in terms of utility with its non-private counterpart.
arXiv Detail & Related papers (2022-10-25T21:21:17Z) - Just Fine-tune Twice: Selective Differential Privacy for Large Language
Models [69.66654761324702]
We propose a simple yet effective just-fine-tune-twice privacy mechanism to achieve SDP for large Transformer-based language models.
Experiments show that our models achieve strong performance while staying robust to the canary insertion attack.
arXiv Detail & Related papers (2022-04-15T22:36:55Z) - Semantics-Preserved Distortion for Personal Privacy Protection in Information Management [65.08939490413037]
This paper suggests a linguistically-grounded approach to distort texts while maintaining semantic integrity.
We present two distinct frameworks for semantic-preserving distortion: a generative approach and a substitutive approach.
We also explore privacy protection in a specific medical information management scenario, showing our method effectively limits sensitive data memorization.
arXiv Detail & Related papers (2022-01-04T04:01:05Z) - CAPE: Context-Aware Private Embeddings for Private Language Learning [0.5156484100374058]
Context-Aware Private Embeddings (CAPE) is a novel approach which preserves privacy during training of embeddings.
CAPE applies calibrated noise through differential privacy, preserving the encoded semantic links while obscuring sensitive information.
Experimental results demonstrate that the proposed approach reduces private information leakage better than either single intervention.
arXiv Detail & Related papers (2021-08-27T14:50:12Z) - Privacy-Adaptive BERT for Natural Language Understanding [20.821155542969947]
We study how to improve the effectiveness of NLU models under a Local Privacy setting using BERT.
We propose privacy-adaptive LM pretraining methods and demonstrate that they can significantly improve model performance on privatized text input.
arXiv Detail & Related papers (2021-04-15T15:01:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.