KnowledgeSG: Privacy-Preserving Synthetic Text Generation with Knowledge Distillation from Server
- URL: http://arxiv.org/abs/2410.05725v2
- Date: Thu, 10 Oct 2024 03:58:35 GMT
- Title: KnowledgeSG: Privacy-Preserving Synthetic Text Generation with Knowledge Distillation from Server
- Authors: Wenhao Wang, Xiaoyu Liang, Rui Ye, Jingyi Chai, Siheng Chen, Yanfeng Wang,
- Abstract summary: Large language models (LLMs) facilitate many parties to fine-tune LLMs on their own private data.
Existing solutions, such as utilizing synthetic data for substitution, struggle to simultaneously improve performance and preserve privacy.
We propose KnowledgeSG, a novel client-server framework which enhances synthetic data quality and improves model performance while ensuring privacy.
- Score: 48.04903443425111
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of large language models (LLMs) facilitate many parties to fine-tune LLMs on their own private data. However, this practice raises privacy concerns due to the memorization of LLMs. Existing solutions, such as utilizing synthetic data for substitution, struggle to simultaneously improve performance and preserve privacy. They either rely on a local model for generation, resulting in a performance decline, or take advantage of APIs, directly exposing the data to API servers. To address this issue, we propose KnowledgeSG, a novel client-server framework which enhances synthetic data quality and improves model performance while ensuring privacy. We achieve this by learning local knowledge from the private data with differential privacy (DP) and distilling professional knowledge from the server. Additionally, inspired by federated learning, we transmit models rather than data between the client and server to prevent privacy leakage. Extensive experiments in medical and financial domains demonstrate the effectiveness of KnowledgeSG. Our code is now publicly available at https://github.com/wwh0411/KnowledgeSG.
Related papers
- The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented
Generation (RAG) [56.67603627046346]
Retrieval-augmented generation (RAG) is a powerful technique to facilitate language model with proprietary and private data.
In this work, we conduct empirical studies with novel attack methods, which demonstrate the vulnerability of RAG systems on leaking the private retrieval database.
arXiv Detail & Related papers (2024-02-23T18:35:15Z) - DP-OPT: Make Large Language Model Your Privacy-Preserving Prompt Engineer [57.04801796205638]
Large Language Models (LLMs) have emerged as dominant tools for various tasks.
However, concerns surrounding data privacy present obstacles due to the tuned prompts' dependency on sensitive private information.
We present Differentially-Private Offsite Prompt Tuning (DP-OPT) to address this challenge.
arXiv Detail & Related papers (2023-11-27T02:01:10Z) - PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - Love or Hate? Share or Split? Privacy-Preserving Training Using Split
Learning and Homomorphic Encryption [47.86010265348072]
Split learning (SL) is a new collaborative learning technique that allows participants to train machine learning models without the client sharing raw data.
Previous works demonstrated that reconstructing activation maps could result in privacy leakage of client data.
In this paper, we improve upon previous works by constructing a protocol based on U-shaped SL that can operate on homomorphically encrypted data.
arXiv Detail & Related papers (2023-09-19T10:56:08Z) - A More Secure Split: Enhancing the Security of Privacy-Preserving Split Learning [2.853180143237022]
Split learning (SL) is a new collaborative learning technique that allows participants to train machine learning models without the client sharing raw data.
Previous works demonstrated that reconstructing Activation Maps (AMs) could result in privacy leakage of client data.
In this paper, we improve upon previous works by constructing a protocol based on U-shaped SL that can operate on homomorphically encrypted data.
arXiv Detail & Related papers (2023-09-15T18:39:30Z) - Privacy Implications of Retrieval-Based Language Models [26.87950501433784]
We present the first study of privacy risks in retrieval-based LMs, particularly $k$NN-LMs.
We find that $k$NN-LMs are more susceptible to leaking private information from their private datastore than parametric models.
arXiv Detail & Related papers (2023-05-24T08:37:27Z) - FedBot: Enhancing Privacy in Chatbots with Federated Learning [0.0]
Federated Learning (FL) aims to protect data privacy through distributed learning methods that keep the data in its location.
The POC combines Deep Bidirectional Transformer models and federated learning algorithms to protect customer data privacy during collaborative model training.
The system is specifically designed to improve its performance and accuracy over time by leveraging its ability to learn from previous interactions.
arXiv Detail & Related papers (2023-04-04T23:13:52Z) - Smooth Anonymity for Sparse Graphs [69.1048938123063]
differential privacy has emerged as the gold standard of privacy, however, when it comes to sharing sparse datasets.
In this work, we consider a variation of $k$-anonymity, which we call smooth-$k$-anonymity, and design simple large-scale algorithms that efficiently provide smooth-$k$-anonymity.
arXiv Detail & Related papers (2022-07-13T17:09:25Z) - Efficient and Privacy Preserving Group Signature for Federated Learning [2.121963121603413]
Federated Learning (FL) is a Machine Learning (ML) technique that aims to reduce the threats to user data privacy.
This paper proposes an efficient and privacy-preserving protocol for FL based on group signature.
arXiv Detail & Related papers (2022-07-12T04:12:10Z) - FedCG: Leverage Conditional GAN for Protecting Privacy and Maintaining Competitive Performance in Federated Learning [11.852346300577494]
Federated learning (FL) aims to protect data privacy by enabling clients to build machine learning models collaboratively without sharing their private data.
Recent works demonstrate that information exchanged during FL is subject to gradient-based privacy attacks.
We propose $textscFedCG$, a novel federated learning method that leverages conditional generative adversarial networks to achieve high-level privacy protection.
arXiv Detail & Related papers (2021-11-16T03:20:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.