Synthetic Data Can Mislead Evaluations: Membership Inference as Machine Text Detection
- URL: http://arxiv.org/abs/2501.11786v1
- Date: Mon, 20 Jan 2025 23:19:15 GMT
- Title: Synthetic Data Can Mislead Evaluations: Membership Inference as Machine Text Detection
- Authors: Ali Naseh, Niloofar Mireshghallah,
- Abstract summary: Using synthetic data in membership evaluations may lead to false conclusions about model memorization and data leakage.
This issue could affect other evaluations using model signals such as loss where synthetic or machine-generated translated data substitutes for real-world samples.
- Score: 1.03590082373586
- License:
- Abstract: Recent work shows membership inference attacks (MIAs) on large language models (LLMs) produce inconclusive results, partly due to difficulties in creating non-member datasets without temporal shifts. While researchers have turned to synthetic data as an alternative, we show this approach can be fundamentally misleading. Our experiments indicate that MIAs function as machine-generated text detectors, incorrectly identifying synthetic data as training samples regardless of the data source. This behavior persists across different model architectures and sizes, from open-source models to commercial ones such as GPT-3.5. Even synthetic text generated by different, potentially larger models is classified as training data by the target model. Our findings highlight a serious concern: using synthetic data in membership evaluations may lead to false conclusions about model memorization and data leakage. We caution that this issue could affect other evaluations using model signals such as loss where synthetic or machine-generated translated data substitutes for real-world samples.
Related papers
- How to Synthesize Text Data without Model Collapse? [37.219627817995054]
Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance.
We propose token editing on human-produced data to obtain semi-synthetic data.
arXiv Detail & Related papers (2024-12-19T09:43:39Z) - Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World [19.266191284270793]
generative machine learning models are pretrained on web-scale datasets containing data generated by earlier models.
Some prior work warns of "model collapse" as the web is overwhelmed by synthetic data.
We report experiments on three ways of using data (training-workflows) across three generative model task-settings.
arXiv Detail & Related papers (2024-10-22T05:49:24Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification [11.6055501181235]
We investigate the use of verification on synthesized data to prevent model collapse.
We show that verifiers, even imperfect ones, can indeed be harnessed to prevent model collapse.
arXiv Detail & Related papers (2024-06-11T17:46:16Z) - Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks.
Such models tend to be large and require commensurate volumes of training data.
It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs.
Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Membership Inference Attacks against Synthetic Data through Overfitting
Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution.
We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z) - A Kernelised Stein Statistic for Assessing Implicit Generative Models [10.616967871198689]
We propose a principled procedure to assess the quality of a synthetic data generator.
The sample size from the synthetic data generator can be as large as desired, while the size of the observed data, which the generator aims to emulate is fixed.
arXiv Detail & Related papers (2022-05-31T23:40:21Z) - Advancing Semi-Supervised Learning for Automatic Post-Editing: Data-Synthesis by Mask-Infilling with Erroneous Terms [5.366354612549173]
We focus on data-synthesis methods to create high-quality synthetic data.
We present a data-synthesis method by which the resulting synthetic data mimic the translation errors found in actual data.
Experimental results show that using the synthetic data created by our approach results in significantly better APE performance than other synthetic data created by existing methods.
arXiv Detail & Related papers (2022-04-08T07:48:57Z) - Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective.
Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination.
Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.