Related papers: Conditional Synthetic Data Generation for Robust Machine Learning Applications with Limited Pandemic Data

Conditional Synthetic Data Generation for Robust Machine Learning Applications with Limited Pandemic Data

URL: http://arxiv.org/abs/2109.06486v1
Date: Tue, 14 Sep 2021 07:30:54 GMT
Title: Conditional Synthetic Data Generation for Robust Machine Learning Applications with Limited Pandemic Data
Authors: Hari Prasanna Das, Ryan Tran, Japjot Singh, Xiangyu Yue, Geoff Tison, Alberto Sangiovanni-Vincentelli, Costas J. Spanos
Abstract summary: We present a hybrid model consisting of a conditional generative flow and a classifier for conditional synthetic data generation. We generate synthetic data by manipulating the local noise with fixed conditional feature representation. We show that our method significantly outperforms existing models both on qualitative and quantitative performance.
Score: 11.535196994689501
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: $\textbf{Background:}$ At the onset of a pandemic, such as COVID-19, data with proper labeling/attributes corresponding to the new disease might be unavailable or sparse. Machine Learning (ML) models trained with the available data, which is limited in quantity and poor in diversity, will often be biased and inaccurate. At the same time, ML algorithms designed to fight pandemics must have good performance and be developed in a time-sensitive manner. To tackle the challenges of limited data, and label scarcity in the available data, we propose generating conditional synthetic data, to be used alongside real data for developing robust ML models. $\textbf{Methods:}$ We present a hybrid model consisting of a conditional generative flow and a classifier for conditional synthetic data generation. The classifier decouples the feature representation for the condition, which is fed to the flow to extract the local noise. We generate synthetic data by manipulating the local noise with fixed conditional feature representation. We also propose a semi-supervised approach to generate synthetic samples in the absence of labels for a majority of the available data. $\textbf{Results:}$ We performed conditional synthetic generation for chest computed tomography (CT) scans corresponding to normal, COVID-19, and pneumonia afflicted patients. We show that our method significantly outperforms existing models both on qualitative and quantitative performance, and our semi-supervised approach can efficiently synthesize conditional samples under label scarcity. As an example of downstream use of synthetic data, we show improvement in COVID-19 detection from CT scans with conditional synthetic data augmentation.

Related papers

Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
Maximizing the Potential of Synthetic Data: Insights from Random Matrix Theory [8.713796223707398]
We use random matrix theory to derive the performance of a binary classifier trained on a mix of real and synthetic data. Our findings identify conditions where synthetic data could improve performance, focusing on the quality of the generative model and verification strategy.
arXiv Detail & Related papers (2024-10-11T16:09:27Z)
Improving Grammatical Error Correction via Contextual Data Augmentation [49.746484518527716]
We propose a synthetic data construction method based on contextual augmentation. Specifically, we combine rule-based substitution with model-based generation. We also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data.
arXiv Detail & Related papers (2024-06-25T10:49:56Z)
Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs) Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws. Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z)
Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification [11.6055501181235]
We investigate the use of verification on synthesized data to prevent model collapse. We show that verifiers, even imperfect ones, can indeed be harnessed to prevent model collapse.
arXiv Detail & Related papers (2024-06-11T17:46:16Z)
Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data. We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap. Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z)
Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task. We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
Can segmentation models be trained with fully synthetically generated data? [0.39577682622066246]
BrainSPADE is a model which combines a synthetic diffusion-based label generator with a semantic image generator. Our model can produce fully synthetic brain labels on-demand, with or without pathology of interest, and then generate a corresponding MRI image of an arbitrary guided style. Experiments show that brainSPADE synthetic data can be used to train segmentation models with performance comparable to that of models trained on real data.
arXiv Detail & Related papers (2022-09-17T05:24:04Z)
A Kernelised Stein Statistic for Assessing Implicit Generative Models [10.616967871198689]
We propose a principled procedure to assess the quality of a synthetic data generator. The sample size from the synthetic data generator can be as large as desired, while the size of the observed data, which the generator aims to emulate is fixed.
arXiv Detail & Related papers (2022-05-31T23:40:21Z)
Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective. Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z)
Autoregressive Score Matching [113.4502004812927]
We propose autoregressive conditional score models (AR-CSM) where we parameterize the joint distribution in terms of the derivatives of univariable log-conditionals (scores) For AR-CSM models, this divergence between data and model distributions can be computed and optimized efficiently, requiring no expensive sampling or adversarial training. We show with extensive experimental results that it can be applied to density estimation on synthetic data, image generation, image denoising, and training latent variable models with implicit encoders.
arXiv Detail & Related papers (2020-10-24T07:01:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.