Synthetic data, real errors: how (not) to publish and use synthetic data
- URL: http://arxiv.org/abs/2305.09235v2
- Date: Sat, 8 Jul 2023 09:15:05 GMT
- Title: Synthetic data, real errors: how (not) to publish and use synthetic data
- Authors: Boris van Breugel, Zhaozhi Qian, Mihaela van der Schaar
- Abstract summary: We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
- Score: 86.65594304109567
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating synthetic data through generative models is gaining interest in
the ML community and beyond, promising a future where datasets can be tailored
to individual needs. Unfortunately, synthetic data is usually not perfect,
resulting in potential errors in downstream tasks. In this work we explore how
the generative process affects the downstream ML task. We show that the naive
synthetic data approach -- using synthetic data as if it is real -- leads to
downstream models and analyses that do not generalize well to real data. As a
first step towards better ML in the synthetic data regime, we introduce Deep
Generative Ensemble (DGE) -- a framework inspired by Deep Ensembles that aims
to implicitly approximate the posterior distribution over the generative
process model parameters. DGE improves downstream model training, evaluation,
and uncertainty quantification, vastly outperforming the naive approach on
average. The largest improvements are achieved for minority classes and
low-density regions of the original data, for which the generative uncertainty
is largest.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data.
SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z) - Self-Improving Diffusion Models with Synthetic Data [12.597035060380001]
Self-IM diffusion models with Synthetic data (SIMS) is a new training concept for diffusion models.
SIMS uses self-synthesized data to provide negative guidance during the generation process.
It is the first prophylactic generative AI algorithm that can be iteratively trained on self-generated synthetic data without going MAD.
arXiv Detail & Related papers (2024-08-29T08:12:18Z) - Improving Grammatical Error Correction via Contextual Data Augmentation [49.746484518527716]
We propose a synthetic data construction method based on contextual augmentation.
Specifically, we combine rule-based substitution with model-based generation.
We also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data.
arXiv Detail & Related papers (2024-06-25T10:49:56Z) - Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences [20.629333587044012]
We study the impact of data curation on iterated retraining of generative models.
We prove that, if the data is curated according to a reward model, the expected reward of the iterative retraining procedure is maximized.
arXiv Detail & Related papers (2024-06-12T21:28:28Z) - Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks.
Such models tend to be large and require commensurate volumes of training data.
It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs.
Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Regularizing Neural Networks with Meta-Learning Generative Models [40.45689466486025]
We present a novel strategy for generative data augmentation called meta generative regularization (MGR)
MGR utilizes synthetic samples in the regularization term for feature extractors instead of in the loss function, e.g., cross-entropy.
Experiments on six datasets showed that MGR is effective particularly when datasets are smaller and stably outperforms baselines.
arXiv Detail & Related papers (2023-07-26T01:47:49Z) - Copula Flows for Synthetic Data Generation [0.5801044612920815]
We propose to use a probabilistic model as a synthetic data generator.
We benchmark our method on both simulated and real data-sets in terms of density estimation.
arXiv Detail & Related papers (2021-01-03T10:06:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.