Privacy-Preserving Federated Embedding Learning for Localized Retrieval-Augmented Generation
- URL: http://arxiv.org/abs/2504.19101v1
- Date: Sun, 27 Apr 2025 04:26:02 GMT
- Title: Privacy-Preserving Federated Embedding Learning for Localized Retrieval-Augmented Generation
- Authors: Qianren Mao, Qili Zhang, Hanwen Hao, Zhentao Han, Runhua Xu, Weifeng Jiang, Qi Hu, Zhijun Chen, Tyler Zhou, Bo Li, Yangqiu Song, Jin Dong, Jianxin Li, Philip S. Yu,
- Abstract summary: We propose a novel framework called Federated Retrieval-Augmented Generation (FedE4RAG)<n>FedE4RAG facilitates collaborative training of client-side RAG retrieval models.<n>We apply homomorphic encryption within federated learning to safeguard model parameters.
- Score: 60.81109086640437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution for enhancing the accuracy and credibility of Large Language Models (LLMs), particularly in Question & Answer tasks. This is achieved by incorporating proprietary and private data from integrated databases. However, private RAG systems face significant challenges due to the scarcity of private domain data and critical data privacy issues. These obstacles impede the deployment of private RAG systems, as developing privacy-preserving RAG systems requires a delicate balance between data security and data availability. To address these challenges, we regard federated learning (FL) as a highly promising technology for privacy-preserving RAG services. We propose a novel framework called Federated Retrieval-Augmented Generation (FedE4RAG). This framework facilitates collaborative training of client-side RAG retrieval models. The parameters of these models are aggregated and distributed on a central-server, ensuring data privacy without direct sharing of raw data. In FedE4RAG, knowledge distillation is employed for communication between the server and client models. This technique improves the generalization of local RAG retrievers during the federated learning process. Additionally, we apply homomorphic encryption within federated learning to safeguard model parameters and mitigate concerns related to data leakage. Extensive experiments conducted on the real-world dataset have validated the effectiveness of FedE4RAG. The results demonstrate that our proposed framework can markedly enhance the performance of private RAG systems while maintaining robust data privacy protection.
Related papers
- FedRand: Enhancing Privacy in Federated Learning with Randomized LoRA Subparameter Updates [58.18162789618869]
Federated Learning (FL) is a widely used framework for training models in a decentralized manner.
We propose the FedRand framework, which avoids disclosing the full set of client parameters.
We empirically validate that FedRand improves robustness against MIAs compared to relevant baselines.
arXiv Detail & Related papers (2025-03-10T11:55:50Z) - FedEM: A Privacy-Preserving Framework for Concurrent Utility Preservation in Federated Learning [17.853502904387376]
Federated Learning (FL) enables collaborative training of models across distributed clients without sharing local data, addressing privacy concerns in decentralized systems.<n>We propose Federated Error Minimization (FedEM), a novel algorithm that incorporates controlled perturbations through adaptive noise injection.<n> Experimental results on benchmark datasets demonstrate that FedEM significantly reduces privacy risks and preserves model accuracy, achieving a robust balance between privacy protection and utility preservation.
arXiv Detail & Related papers (2025-03-08T02:48:00Z) - RAG-Thief: Scalable Extraction of Private Data from Retrieval-Augmented Generation Applications with Agent-based Attacks [18.576435409729655]
We propose an agent-based automated privacy attack called RAG-Thief.
It can extract a scalable amount of private data from the private database used in RAG applications.
Our findings highlight the privacy vulnerabilities in current RAG applications and underscore the pressing need for stronger safeguards.
arXiv Detail & Related papers (2024-11-21T13:18:03Z) - Lancelot: Towards Efficient and Privacy-Preserving Byzantine-Robust Federated Learning within Fully Homomorphic Encryption [10.685816010576918]
We propose Lancelot, an innovative and computationally efficient BRFL framework that employs fully homomorphic encryption (FHE) to safeguard against malicious client activities while preserving data privacy.
Our extensive testing, which includes medical imaging diagnostics and widely-used public image datasets, demonstrates that Lancelot significantly outperforms existing methods, offering more than a twenty-fold increase in processing speed, all while maintaining data privacy.
arXiv Detail & Related papers (2024-08-12T14:48:25Z) - Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources.<n>RAG systems may face severe privacy risks when retrieving private data.<n>We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z) - Is My Data in Your Retrieval Database? Membership Inference Attacks Against Retrieval Augmented Generation [0.9217021281095907]
We introduce an efficient and easy-to-use method for conducting a Membership Inference Attack (MIA) against RAG systems.<n>We demonstrate the effectiveness of our attack using two benchmark datasets and multiple generative models.<n>Our findings highlight the importance of implementing security countermeasures in deployed RAG systems.
arXiv Detail & Related papers (2024-05-30T19:46:36Z) - The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented
Generation (RAG) [56.67603627046346]
Retrieval-augmented generation (RAG) is a powerful technique to facilitate language model with proprietary and private data.
In this work, we conduct empirical studies with novel attack methods, which demonstrate the vulnerability of RAG systems on leaking the private retrieval database.
arXiv Detail & Related papers (2024-02-23T18:35:15Z) - Free Lunch for Federated Remote Sensing Target Fine-Grained
Classification: A Parameter-Efficient Framework [23.933367972846312]
This paper proposes a novel Privacy-Reserving TFGC Framework based on Federated Learning, dubbed PRFL.
We demonstrate the effectiveness of the proposed PRFL on the classical TFGC task by leveraging four public datasets.
arXiv Detail & Related papers (2024-01-03T01:45:00Z) - Federated Learning Empowered by Generative Content [55.576885852501775]
Federated learning (FL) enables leveraging distributed private data for model training in a privacy-preserving way.
We propose a novel FL framework termed FedGC, designed to mitigate data heterogeneity issues by diversifying private data with generative content.
We conduct a systematic empirical study on FedGC, covering diverse baselines, datasets, scenarios, and modalities.
arXiv Detail & Related papers (2023-12-10T07:38:56Z) - PS-FedGAN: An Efficient Federated Learning Framework Based on Partially
Shared Generative Adversarial Networks For Data Privacy [56.347786940414935]
Federated Learning (FL) has emerged as an effective learning paradigm for distributed computation.
This work proposes a novel FL framework that requires only partial GAN model sharing.
Named as PS-FedGAN, this new framework enhances the GAN releasing and training mechanism to address heterogeneous data distributions.
arXiv Detail & Related papers (2023-05-19T05:39:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.