Related papers: Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe

Related papers

Privacy-Preserving Model Transcription with Differentially Private Synthetic Distillation [67.76456940243294]
Deep learning models trained on private datasets may pose a privacy leakage risk.<n>We present emphprivacy-preserving model transcription, a data-free model-to-model conversion solution.
arXiv Detail & Related papers (2026-01-27T01:51:35Z)
Empirical Evaluation of Structured Synthetic Data Privacy Metrics: Novel experimental framework [34.56525983543448]
Synthetic data generation is gaining traction as a privacy enhancing technology.<n>The concept of data privacy remains elusive, making it challenging for practitioners to evaluate and benchmark the degree of privacy protection offered by synthetic data.
arXiv Detail & Related papers (2025-12-18T08:09:28Z)
How to DP-fy Your Data: A Practical Guide to Generating Synthetic Data With Differential Privacy [52.00934156883483]
Differential Privacy (DP) is a framework for reasoning about and limiting information leakage.<n>Differentially Private Synthetic data refers to synthetic data that preserves the overall trends of source data.
arXiv Detail & Related papers (2025-12-02T21:14:39Z)
Privacy-Aware In-Context Learning for Large Language Models [12.605629953620495]
Large language models (LLMs) raise privacy concerns due to potential exposure of sensitive information.<n>We introduce a novel private prediction framework for generating high-quality synthetic text with strong privacy guarantees.
arXiv Detail & Related papers (2025-09-17T01:50:32Z)
Synthetic Data Generation and Differential Privacy using Tensor Networks' Matrix Product States (MPS) [33.032422801043495]
We propose a method for generating privacy-preserving high-quality synthetic data using Matrix Product States (MPS)<n>We benchmark the MPS-based generative model against state-of-the-art models such as CTGAN, VAE, and PrivBayes.<n>Our results show that MPS outperforms classical models, particularly under strict privacy constraints.
arXiv Detail & Related papers (2025-08-08T12:14:57Z)
SMOTE-DP: Improving Privacy-Utility Tradeoff with Synthetic Data [13.699107354397286]
We show that, with the right mechanism of synthetic data generation, we can achieve strong privacy protection without significant utility loss.<n>We prove in theory and through empirical demonstration that this SMOTE-DP technique can produce synthetic data that not only ensures robust privacy protection but maintains utility in downstream learning tasks.
arXiv Detail & Related papers (2025-06-02T17:27:10Z)
Synthetic Data Privacy Metrics [2.1213500139850017]
We review the pros and cons of popular metrics that include simulations of adversarial attacks. We also review current best practices for amending generative models to enhance the privacy of the data they create.
arXiv Detail & Related papers (2025-01-07T17:02:33Z)
SafeSynthDP: Leveraging Large Language Models for Privacy-Preserving Synthetic Data Generation Using Differential Privacy [0.0]
We investigate capability of Large Language Models (Ms) to generate synthetic datasets with Differential Privacy (DP) mechanisms. Our approach incorporates DP-based noise injection methods, including Laplace and Gaussian distributions, into the data generation process. We then evaluate the utility of these DP-enhanced synthetic datasets by comparing the performance of ML models trained on them against models trained on the original data.
arXiv Detail & Related papers (2024-12-30T01:10:10Z)
Evaluating Differentially Private Synthetic Data Generation in High-Stakes Domains [9.123834467375532]
We explore the feasibility of using synthetic data generated from differentially private language models in place of real data to facilitate the development of NLP in high-stakes domains. Our results show that prior simplistic evaluations have failed to highlight utility, privacy, and fairness issues in the synthetic data.
arXiv Detail & Related papers (2024-10-10T19:31:02Z)
Private prediction for large-scale synthetic text generation [28.488459921169905]
We present an approach for generating differentially private synthetic text using large language models (LLMs) In the private prediction framework, we only require the output synthetic data to satisfy differential privacy guarantees.
arXiv Detail & Related papers (2024-07-16T18:28:40Z)
Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources. RAG systems may face severe privacy risks when retrieving private data. We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z)
NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human [55.20137833039499]
We suggest sanitizing sensitive text using two common strategies used by humans. We curate the first corpus, coined NAP2, through both crowdsourcing and the use of large language models.
arXiv Detail & Related papers (2024-06-06T05:07:44Z)
FewFedPIT: Towards Privacy-preserving and Few-shot Federated Instruction Tuning [54.26614091429253]
Federated instruction tuning (FedIT) is a promising solution, by consolidating collaborative training across multiple data owners. FedIT encounters limitations such as scarcity of instructional data and risk of exposure to training data extraction attacks. We propose FewFedPIT, designed to simultaneously enhance privacy protection and model performance of federated few-shot learning.
arXiv Detail & Related papers (2024-03-10T08:41:22Z)
Practical considerations on using private sampling for synthetic data [1.3654846342364308]
Differential privacy for synthetic data generation has received much attention due to the ability of preserving privacy while freely using the synthetic data. Private sampling is the first noise-free method to construct differentially private synthetic data with rigorous bounds for privacy and accuracy. We provide an implementation of the private sampling algorithm and discuss the realism of its constraints in practical cases.
arXiv Detail & Related papers (2023-12-12T10:20:04Z)
PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind) Our work offers a theoretical analysis for model design and benchmarks various techniques. In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z)
Private Set Generation with Discriminative Information [63.851085173614]
Differentially private data generation is a promising solution to the data privacy challenge. Existing private generative models are struggling with the utility of synthetic samples. We introduce a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-07T10:02:55Z)
Differentially Private Language Models for Secure Data Sharing [19.918137395199224]
In this paper, we show how to train a generative language model in a differentially private manner and consequently sampling data from it. Using natural language prompts and a new prompt-mismatch loss, we are able to create highly accurate and fluent textual datasets. We perform thorough experiments indicating that our synthetic datasets do not leak information from our original data and are of high language quality.
arXiv Detail & Related papers (2022-10-25T11:12:56Z)
Just Fine-tune Twice: Selective Differential Privacy for Large Language Models [69.66654761324702]
We propose a simple yet effective just-fine-tune-twice privacy mechanism to achieve SDP for large Transformer-based language models. Experiments show that our models achieve strong performance while staying robust to the canary insertion attack.
arXiv Detail & Related papers (2022-04-15T22:36:55Z)
PEARL: Data Synthesis via Private Embeddings and Adversarial Reconstruction Learning [1.8692254863855962]
We propose a new framework of data using deep generative models in a differentially private manner. Within our framework, sensitive data are sanitized with rigorous privacy guarantees in a one-shot fashion. Our proposal has theoretical guarantees of performance, and empirical evaluations on multiple datasets show that our approach outperforms other methods at reasonable levels of privacy.
arXiv Detail & Related papers (2021-06-08T18:00:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.