Conditional Synthetic Data Generation for Robust Machine Learning
Applications with Limited Pandemic Data
- URL: http://arxiv.org/abs/2109.06486v1
- Date: Tue, 14 Sep 2021 07:30:54 GMT
- Title: Conditional Synthetic Data Generation for Robust Machine Learning
Applications with Limited Pandemic Data
- Authors: Hari Prasanna Das, Ryan Tran, Japjot Singh, Xiangyu Yue, Geoff Tison,
Alberto Sangiovanni-Vincentelli, Costas J. Spanos
- Abstract summary: We present a hybrid model consisting of a conditional generative flow and a classifier for conditional synthetic data generation.
We generate synthetic data by manipulating the local noise with fixed conditional feature representation.
We show that our method significantly outperforms existing models both on qualitative and quantitative performance.
- Score: 11.535196994689501
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: $\textbf{Background:}$ At the onset of a pandemic, such as COVID-19, data
with proper labeling/attributes corresponding to the new disease might be
unavailable or sparse. Machine Learning (ML) models trained with the available
data, which is limited in quantity and poor in diversity, will often be biased
and inaccurate. At the same time, ML algorithms designed to fight pandemics
must have good performance and be developed in a time-sensitive manner. To
tackle the challenges of limited data, and label scarcity in the available
data, we propose generating conditional synthetic data, to be used alongside
real data for developing robust ML models. $\textbf{Methods:}$ We present a
hybrid model consisting of a conditional generative flow and a classifier for
conditional synthetic data generation. The classifier decouples the feature
representation for the condition, which is fed to the flow to extract the local
noise. We generate synthetic data by manipulating the local noise with fixed
conditional feature representation. We also propose a semi-supervised approach
to generate synthetic samples in the absence of labels for a majority of the
available data. $\textbf{Results:}$ We performed conditional synthetic
generation for chest computed tomography (CT) scans corresponding to normal,
COVID-19, and pneumonia afflicted patients. We show that our method
significantly outperforms existing models both on qualitative and quantitative
performance, and our semi-supervised approach can efficiently synthesize
conditional samples under label scarcity. As an example of downstream use of
synthetic data, we show improvement in COVID-19 detection from CT scans with
conditional synthetic data augmentation.
Related papers
- Maximizing the Potential of Synthetic Data: Insights from Random Matrix Theory [8.713796223707398]
We use random matrix theory to derive the performance of a binary classifier trained on a mix of real and synthetic data.
Our findings identify conditions where synthetic data could improve performance, focusing on the quality of the generative model and verification strategy.
arXiv Detail & Related papers (2024-10-11T16:09:27Z) - Improving Grammatical Error Correction via Contextual Data Augmentation [49.746484518527716]
We propose a synthetic data construction method based on contextual augmentation.
Specifically, we combine rule-based substitution with model-based generation.
We also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data.
arXiv Detail & Related papers (2024-06-25T10:49:56Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification [11.6055501181235]
We investigate the use of verification on synthesized data to prevent model collapse.
We show that verifiers, even imperfect ones, can indeed be harnessed to prevent model collapse.
arXiv Detail & Related papers (2024-06-11T17:46:16Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Can segmentation models be trained with fully synthetically generated
data? [0.39577682622066246]
BrainSPADE is a model which combines a synthetic diffusion-based label generator with a semantic image generator.
Our model can produce fully synthetic brain labels on-demand, with or without pathology of interest, and then generate a corresponding MRI image of an arbitrary guided style.
Experiments show that brainSPADE synthetic data can be used to train segmentation models with performance comparable to that of models trained on real data.
arXiv Detail & Related papers (2022-09-17T05:24:04Z) - A Kernelised Stein Statistic for Assessing Implicit Generative Models [10.616967871198689]
We propose a principled procedure to assess the quality of a synthetic data generator.
The sample size from the synthetic data generator can be as large as desired, while the size of the observed data, which the generator aims to emulate is fixed.
arXiv Detail & Related papers (2022-05-31T23:40:21Z) - Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective.
Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination.
Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z) - Autoregressive Score Matching [113.4502004812927]
We propose autoregressive conditional score models (AR-CSM) where we parameterize the joint distribution in terms of the derivatives of univariable log-conditionals (scores)
For AR-CSM models, this divergence between data and model distributions can be computed and optimized efficiently, requiring no expensive sampling or adversarial training.
We show with extensive experimental results that it can be applied to density estimation on synthetic data, image generation, image denoising, and training latent variable models with implicit encoders.
arXiv Detail & Related papers (2020-10-24T07:01:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.