Differentially Private Synthetic Text Generation for Retrieval-Augmented Generation (RAG)
- URL: http://arxiv.org/abs/2510.06719v1
- Date: Wed, 08 Oct 2025 07:15:50 GMT
- Title: Differentially Private Synthetic Text Generation for Retrieval-Augmented Generation (RAG)
- Authors: Junki Mori, Kazuya Kakizaki, Taiki Miyagawa, Jun Sakuma,
- Abstract summary: We propose DP-SynRAG, a framework that uses LLMs to generate differentially private synthetic RAG databases.<n>Unlike prior methods, the synthetic text can be reused once created, thereby avoiding repeated noise injection and additional privacy costs.<n>Experiments show that DP-SynRAG achieves superior performanec to the state-of-the-art private RAG systems while maintaining a fixed privacy budget.
- Score: 13.736991294264827
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by grounding them in external knowledge. However, its application in sensitive domains is limited by privacy risks. Existing private RAG methods typically rely on query-time differential privacy (DP), which requires repeated noise injection and leads to accumulated privacy loss. To address this issue, we propose DP-SynRAG, a framework that uses LLMs to generate differentially private synthetic RAG databases. Unlike prior methods, the synthetic text can be reused once created, thereby avoiding repeated noise injection and additional privacy costs. To preserve essential information for downstream RAG tasks, DP-SynRAG extends private prediction, which instructs LLMs to generate text that mimics subsampled database records in a DP manner. Experiments show that DP-SynRAG achieves superior performanec to the state-of-the-art private RAG systems while maintaining a fixed privacy budget, offering a scalable solution for privacy-preserving RAG.
Related papers
- DP-RFT: Learning to Generate Synthetic Text via Differentially Private Reinforcement Fine-Tuning [51.35628297101575]
Differentially private (DP) synthetic data generation plays a pivotal role in developing large language models (LLMs) on private data.<n>We introduce Differentially Private Reinforcement Fine-Tuning (DP-RFT), an online reinforcement learning algorithm for synthetic data generation with LLMs.<n>We evaluate DP-RFT for long-form and domain-specific synthetic data generation, such as news articles, meeting transcripts, and medical article abstracts.
arXiv Detail & Related papers (2026-02-20T22:03:56Z) - Differentially Private Retrieval-Augmented Generation [13.622078883013442]
Retrieval-augmented generation (RAG) is a widely used framework for reducing hallucinations in large language models (LLMs)<n>RAG poses serious privacy risks when the database contains sensitive corpora, such as medical records or legal documents.<n>We present DP-KSA, a novel privacy-preserving RAG algorithm that integrates DP using the propose-test-release paradigm.
arXiv Detail & Related papers (2026-02-16T00:52:57Z) - SoK: Privacy Risks and Mitigations in Retrieval-Augmented Generation Systems [53.51921540246166]
Retrieval-Augmented Generation (RAG) techniques have become widely popular.<n>RAG involves the coupling of Large Language Models (LLMs) with domain-specific knowledge bases.<n>The proliferation of RAG has sparked concerns about data privacy.
arXiv Detail & Related papers (2026-01-07T14:50:41Z) - Private-RAG: Answering Multiple Queries with LLMs while Keeping Your Data Private [21.980739918403344]
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by retrieving documents from an external corpus at inference time.<n>When this corpus contains sensitive information, unprotected RAG systems are at risk of leaking private information.<n>In this paper, we study the more practical multi-query setting and propose two DP-RAG algorithms.
arXiv Detail & Related papers (2025-11-10T21:12:32Z) - DP-FedLoRA: Privacy-Enhanced Federated Fine-Tuning for On-Device Large Language Models [17.265217612125905]
DP-FedLoRA is a privacy-enhanced federated fine-tuning framework.<n>It integrates LoRA-based adaptation with differential privacy in a communication-efficient setting.<n>We show that DP-FedLoRA delivers competitive performance while offering strong privacy guarantees.
arXiv Detail & Related papers (2025-09-11T02:16:34Z) - Privacy-Aware Decoding: Mitigating Privacy Leakage of Large Language Models in Retrieval-Augmented Generation [26.573578326262307]
Privacy-Aware Decoding (PAD) is a lightweight, inference-time defense that adaptively injects calibrated Gaussian noise into token logits during generation.<n>PAD integrates confidence-based screening to selectively protect high-risk tokens, efficient sensitivity estimation to minimize unnecessary noise, and context-aware noise calibration to balance privacy with generation quality.<n>Our work takes an important step toward mitigating privacy risks in RAG via decoding strategies, paving the way for universal and scalable privacy solutions in sensitive domains.
arXiv Detail & Related papers (2025-08-05T05:22:13Z) - Privacy-Preserving Federated Embedding Learning for Localized Retrieval-Augmented Generation [60.81109086640437]
We propose a novel framework called Federated Retrieval-Augmented Generation (FedE4RAG)<n>FedE4RAG facilitates collaborative training of client-side RAG retrieval models.<n>We apply homomorphic encryption within federated learning to safeguard model parameters.
arXiv Detail & Related papers (2025-04-27T04:26:02Z) - Privacy-Preserving Retrieval-Augmented Generation with Differential Privacy [25.896416088293908]
retrieval-augmented generation (RAG) is particularly effective in assisting large language models (LLMs)<n>RAG outputs risk leaking sensitive information from the external data source.<n>We propose an algorithm that smartly spends privacy budget only for the tokens that require the sensitive information.
arXiv Detail & Related papers (2024-12-06T01:20:16Z) - Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources.<n>RAG systems may face severe privacy risks when retrieving private data.<n>We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z) - Provable Privacy with Non-Private Pre-Processing [56.770023668379615]
We propose a general framework to evaluate the additional privacy cost incurred by non-private data-dependent pre-processing algorithms.
Our framework establishes upper bounds on the overall privacy guarantees by utilising two new technical notions.
arXiv Detail & Related papers (2024-03-19T17:54:49Z) - The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented
Generation (RAG) [56.67603627046346]
Retrieval-augmented generation (RAG) is a powerful technique to facilitate language model with proprietary and private data.
In this work, we conduct empirical studies with novel attack methods, which demonstrate the vulnerability of RAG systems on leaking the private retrieval database.
arXiv Detail & Related papers (2024-02-23T18:35:15Z) - Private Reinforcement Learning with PAC and Regret Guarantees [69.4202374491817]
We design privacy preserving exploration policies for episodic reinforcement learning (RL)
We first provide a meaningful privacy formulation using the notion of joint differential privacy (JDP)
We then develop a private optimism-based learning algorithm that simultaneously achieves strong PAC and regret bounds, and enjoys a JDP guarantee.
arXiv Detail & Related papers (2020-09-18T20:18:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.