Privacy Implications of Retrieval-Based Language Models
- URL: http://arxiv.org/abs/2305.14888v1
- Date: Wed, 24 May 2023 08:37:27 GMT
- Title: Privacy Implications of Retrieval-Based Language Models
- Authors: Yangsibo Huang, Samyak Gupta, Zexuan Zhong, Kai Li, Danqi Chen
- Abstract summary: We present the first study of privacy risks in retrieval-based LMs, particularly $k$NN-LMs.
We find that $k$NN-LMs are more susceptible to leaking private information from their private datastore than parametric models.
- Score: 26.87950501433784
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrieval-based language models (LMs) have demonstrated improved
interpretability, factuality, and adaptability compared to their parametric
counterparts, by incorporating retrieved text from external datastores. While
it is well known that parametric models are prone to leaking private data, it
remains unclear how the addition of a retrieval datastore impacts model
privacy. In this work, we present the first study of privacy risks in
retrieval-based LMs, particularly $k$NN-LMs. Our goal is to explore the optimal
design and training procedure in domains where privacy is of concern, aiming to
strike a balance between utility and privacy. Crucially, we find that $k$NN-LMs
are more susceptible to leaking private information from their private
datastore than parametric models. We further explore mitigations of privacy
risks. When privacy information is targeted and readily detected in the text,
we find that a simple sanitization step would completely eliminate the risks,
while decoupling query and key encoders achieves an even better utility-privacy
trade-off. Otherwise, we consider strategies of mixing public and private data
in both datastore and encoder training. While these methods offer modest
improvements, they leave considerable room for future work. Together, our
findings provide insights for practitioners to better understand and mitigate
privacy risks in retrieval-based LMs. Our code is available at:
https://github.com/Princeton-SysML/kNNLM_privacy .
Related papers
- FT-PrivacyScore: Personalized Privacy Scoring Service for Machine Learning Participation [4.772368796656325]
In practice, controlled data access remains a mainstream method for protecting data privacy in many industrial and research environments.
We developed the demo prototype FT-PrivacyScore to show that it's possible to efficiently and quantitatively estimate the privacy risk of participating in a model fine-tuning task.
arXiv Detail & Related papers (2024-10-30T02:41:26Z) - KnowledgeSG: Privacy-Preserving Synthetic Text Generation with Knowledge Distillation from Server [48.04903443425111]
Large language models (LLMs) facilitate many parties to fine-tune LLMs on their own private data.
Existing solutions, such as utilizing synthetic data for substitution, struggle to simultaneously improve performance and preserve privacy.
We propose KnowledgeSG, a novel client-server framework which enhances synthetic data quality and improves model performance while ensuring privacy.
arXiv Detail & Related papers (2024-10-08T06:42:28Z) - PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action [54.11479432110771]
PrivacyLens is a novel framework designed to extend privacy-sensitive seeds into expressive vignettes and further into agent trajectories.
We instantiate PrivacyLens with a collection of privacy norms grounded in privacy literature and crowdsourced seeds.
State-of-the-art LMs, like GPT-4 and Llama-3-70B, leak sensitive information in 25.68% and 38.69% of cases, even when prompted with privacy-enhancing instructions.
arXiv Detail & Related papers (2024-08-29T17:58:38Z) - Defining 'Good': Evaluation Framework for Synthetic Smart Meter Data [14.779917834583577]
We show that standard privacy attack methods are inadequate for assessing privacy risks of smart meter datasets.
We propose an improved method by injecting training data with implausible outliers, then launching privacy attacks directly on these outliers.
arXiv Detail & Related papers (2024-07-16T14:41:27Z) - The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented
Generation (RAG) [56.67603627046346]
Retrieval-augmented generation (RAG) is a powerful technique to facilitate language model with proprietary and private data.
In this work, we conduct empirical studies with novel attack methods, which demonstrate the vulnerability of RAG systems on leaking the private retrieval database.
arXiv Detail & Related papers (2024-02-23T18:35:15Z) - Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory [82.7042006247124]
We show that even the most capable AI models reveal private information in contexts that humans would not, 39% and 57% of the time, respectively.
Our work underscores the immediate need to explore novel inference-time privacy-preserving approaches, based on reasoning and theory of mind.
arXiv Detail & Related papers (2023-10-27T04:15:30Z) - Privacy Preserving Large Language Models: ChatGPT Case Study Based Vision and Framework [6.828884629694705]
This article proposes the conceptual model called PrivChatGPT, a privacy-generative model for LLMs.
PrivChatGPT consists of two main components i.e., preserving user privacy during the data curation/pre-processing together with preserving private context and the private training process for large-scale data.
arXiv Detail & Related papers (2023-10-19T06:55:13Z) - PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - Knowledge Unlearning for Mitigating Privacy Risks in Language Models [31.322818016245087]
We propose knowledge unlearning as an alternative method to reduce privacy risks for language models.
We show that simply applying the unlikelihood training objective to target token sequences is effective at forgetting them.
We show that unlearning can give a stronger empirical privacy guarantee in scenarios where the data vulnerable to extraction attacks are known a priori.
arXiv Detail & Related papers (2022-10-04T10:18:11Z) - Federated Deep Learning with Bayesian Privacy [28.99404058773532]
Federated learning (FL) aims to protect data privacy by cooperatively learning a model without sharing private data among users.
Homomorphic encryption (HE) based methods provide secure privacy protections but suffer from extremely high computational and communication overheads.
Deep learning with Differential Privacy (DP) was implemented as a practical learning algorithm at a manageable cost in complexity.
arXiv Detail & Related papers (2021-09-27T12:48:40Z) - Private Reinforcement Learning with PAC and Regret Guarantees [69.4202374491817]
We design privacy preserving exploration policies for episodic reinforcement learning (RL)
We first provide a meaningful privacy formulation using the notion of joint differential privacy (JDP)
We then develop a private optimism-based learning algorithm that simultaneously achieves strong PAC and regret bounds, and enjoys a JDP guarantee.
arXiv Detail & Related papers (2020-09-18T20:18:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.