Related papers: The Canary's Echo: Auditing Privacy Risks of LLM-Generated Synthetic Text

The Canary's Echo: Auditing Privacy Risks of LLM-Generated Synthetic Text

URL: http://arxiv.org/abs/2502.14921v1
Date: Wed, 19 Feb 2025 15:30:30 GMT
Title: The Canary's Echo: Auditing Privacy Risks of LLM-Generated Synthetic Text
Authors: Matthieu Meeus, Lukas Wutschitz, Santiago Zanella-Béguelin, Shruti Tople, Reza Shokri,
Abstract summary: We design membership inference attacks (MIAs) that target data used to fine-tune pre-trained Large Language Models (LLMs)<n>We show that such data-based MIAs do significantly better than a random guess, meaning that synthetic data leaks information about the training data.<n>To tackle this problem, we leverage the mechanics of auto-regressive models to design canaries with an in-distribution prefix and a high-perplexity suffix.
Score: 23.412546862849396
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: How much information about training samples can be gleaned from synthetic data generated by Large Language Models (LLMs)? Overlooking the subtleties of information flow in synthetic data generation pipelines can lead to a false sense of privacy. In this paper, we design membership inference attacks (MIAs) that target data used to fine-tune pre-trained LLMs that are then used to synthesize data, particularly when the adversary does not have access to the fine-tuned model but only to the synthetic data. We show that such data-based MIAs do significantly better than a random guess, meaning that synthetic data leaks information about the training data. Further, we find that canaries crafted to maximize vulnerability to model-based MIAs are sub-optimal for privacy auditing when only synthetic data is released. Such out-of-distribution canaries have limited influence on the model's output when prompted to generate useful, in-distribution synthetic data, which drastically reduces their vulnerability. To tackle this problem, we leverage the mechanics of auto-regressive models to design canaries with an in-distribution prefix and a high-perplexity suffix that leave detectable traces in synthetic data. This enhances the power of data-based MIAs and provides a better assessment of the privacy risks of releasing synthetic data generated by LLMs.

Related papers

Few-shot LLM Synthetic Data with Distribution Matching [37.55363714371521]
Large language models (LLMs) produce high-quality synthetic data to enhance the performance of smaller models.<n>LLMs-generated synthetic data often differs from the real data in key language attributes.<n>We introduce SynAlign: a synthetic data generation and filtering framework based on key attribute distribution matching.
arXiv Detail & Related papers (2025-02-09T16:43:32Z)
Synthetic Data Can Mislead Evaluations: Membership Inference as Machine Text Detection [1.03590082373586]
Using synthetic data in membership evaluations may lead to false conclusions about model memorization and data leakage.<n>This issue could affect other evaluations using model signals such as loss where synthetic or machine-generated translated data substitutes for real-world samples.
arXiv Detail & Related papers (2025-01-20T23:19:15Z)
SafeSynthDP: Leveraging Large Language Models for Privacy-Preserving Synthetic Data Generation Using Differential Privacy [0.0]
We investigate capability of Large Language Models (Ms) to generate synthetic datasets with Differential Privacy (DP) mechanisms.<n>Our approach incorporates DP-based noise injection methods, including Laplace and Gaussian distributions, into the data generation process.<n>We then evaluate the utility of these DP-enhanced synthetic datasets by comparing the performance of ML models trained on them against models trained on the original data.
arXiv Detail & Related papers (2024-12-30T01:10:10Z)
Leveraging Programmatically Generated Synthetic Data for Differentially Private Diffusion Training [4.815212947276105]
Programmatically generated synthetic data has been used in differential private training for classification to avoid privacy leakage.<n>The model trained with synthetic data generates unrealistic random images, raising challenges to adapt synthetic data for generative models.<n>We propose DPSynGen, which leverages generated synthetic data in diffusion models to address this challenge.
arXiv Detail & Related papers (2024-12-13T04:22:23Z)
Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data. SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z)
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data. We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z)
Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs) Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws. Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
Synthetic is all you need: removing the auxiliary data assumption for membership inference attacks against synthetic data [9.061271587514215]
We show how this assumption can be removed, allowing for MIAs to be performed using only the synthetic data. Our results show that MIAs are still successful, across two real-world datasets and two synthetic data generators.
arXiv Detail & Related papers (2023-07-04T13:16:03Z)
Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task. We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs. We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z)
Membership Inference Attacks against Synthetic Data through Overfitting Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution. We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z)
PromDA: Prompt-based Data Augmentation for Low-Resource NLU Tasks [61.51515750218049]
This paper focuses on the Data Augmentation for low-resource Natural Language Understanding (NLU) tasks. We propose Prompt-based Data Augmentation model (PromDA) which only trains small-scale Soft Prompt. PromDA generates synthetic data via two different views and filters out the low-quality data using NLU models.
arXiv Detail & Related papers (2022-02-25T05:09:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.