NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human
- URL: http://arxiv.org/abs/2406.03749v1
- Date: Thu, 6 Jun 2024 05:07:44 GMT
- Title: NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human
- Authors: Shuo Huang, William MacLean, Xiaoxi Kang, Anqi Wu, Lizhen Qu, Qiongkai Xu, Zhuang Li, Xingliang Yuan, Gholamreza Haffari,
- Abstract summary: We suggest sanitizing sensitive text using two common strategies used by humans.
We curate the first corpus, coined NAP2, through both crowdsourcing and the use of large language models.
- Score: 55.20137833039499
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Increasing concerns about privacy leakage issues in academia and industry arise when employing NLP models from third-party providers to process sensitive texts. To protect privacy before sending sensitive data to those models, we suggest sanitizing sensitive text using two common strategies used by humans: i) deleting sensitive expressions, and ii) obscuring sensitive details by abstracting them. To explore the issues and develop a tool for text rewriting, we curate the first corpus, coined NAP^2, through both crowdsourcing and the use of large language models (LLMs). Compared to the prior works based on differential privacy, which lead to a sharp drop in information utility and unnatural texts, the human-inspired approaches result in more natural rewrites and offer an improved balance between privacy protection and data utility, as demonstrated by our extensive experiments.
Related papers
- PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action [54.11479432110771]
PrivacyLens is a novel framework designed to extend privacy-sensitive seeds into expressive vignettes and further into agent trajectories.
We instantiate PrivacyLens with a collection of privacy norms grounded in privacy literature and crowdsourced seeds.
State-of-the-art LMs, like GPT-4 and Llama-3-70B, leak sensitive information in 25.68% and 38.69% of cases, even when prompted with privacy-enhancing instructions.
arXiv Detail & Related papers (2024-08-29T17:58:38Z) - Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory [82.7042006247124]
We show that even the most capable AI models reveal private information in contexts that humans would not, 39% and 57% of the time, respectively.
Our work underscores the immediate need to explore novel inference-time privacy-preserving approaches, based on reasoning and theory of mind.
arXiv Detail & Related papers (2023-10-27T04:15:30Z) - PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - Planting and Mitigating Memorized Content in Predictive-Text Language
Models [11.911353678499008]
Language models are widely deployed to provide automatic text completion services in user products.
Recent research has revealed that language models bear considerable risk of memorizing private training data.
In this study, we test the efficacy of a range of privacy-preserving techniques to mitigate unintended memorization of sensitive user text.
arXiv Detail & Related papers (2022-12-16T17:57:14Z) - Synthetic Text Generation with Differential Privacy: A Simple and
Practical Recipe [32.63295550058343]
We show that a simple and practical recipe in the text domain is effective in generating useful synthetic text with strong privacy protection.
Our method produces synthetic text that is competitive in terms of utility with its non-private counterpart.
arXiv Detail & Related papers (2022-10-25T21:21:17Z) - Just Fine-tune Twice: Selective Differential Privacy for Large Language
Models [69.66654761324702]
We propose a simple yet effective just-fine-tune-twice privacy mechanism to achieve SDP for large Transformer-based language models.
Experiments show that our models achieve strong performance while staying robust to the canary insertion attack.
arXiv Detail & Related papers (2022-04-15T22:36:55Z) - Semantics-Preserved Distortion for Personal Privacy Protection in Information Management [65.08939490413037]
This paper suggests a linguistically-grounded approach to distort texts while maintaining semantic integrity.
We present two distinct frameworks for semantic-preserving distortion: a generative approach and a substitutive approach.
We also explore privacy protection in a specific medical information management scenario, showing our method effectively limits sensitive data memorization.
arXiv Detail & Related papers (2022-01-04T04:01:05Z) - CAPE: Context-Aware Private Embeddings for Private Language Learning [0.5156484100374058]
Context-Aware Private Embeddings (CAPE) is a novel approach which preserves privacy during training of embeddings.
CAPE applies calibrated noise through differential privacy, preserving the encoded semantic links while obscuring sensitive information.
Experimental results demonstrate that the proposed approach reduces private information leakage better than either single intervention.
arXiv Detail & Related papers (2021-08-27T14:50:12Z) - Privacy-Adaptive BERT for Natural Language Understanding [20.821155542969947]
We study how to improve the effectiveness of NLU models under a Local Privacy setting using BERT.
We propose privacy-adaptive LM pretraining methods and demonstrate that they can significantly improve model performance on privatized text input.
arXiv Detail & Related papers (2021-04-15T15:01:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.