Synthetically Enhanced: Unveiling Synthetic Data's Potential in Medical Imaging Research
- URL: http://arxiv.org/abs/2311.09402v2
- Date: Mon, 8 Jul 2024 00:56:36 GMT
- Title: Synthetically Enhanced: Unveiling Synthetic Data's Potential in Medical Imaging Research
- Authors: Bardia Khosravi, Frank Li, Theo Dapamede, Pouria Rouzrokh, Cooper U. Gamble, Hari M. Trivedi, Cody C. Wyles, Andrew B. Sellergren, Saptarshi Purkayastha, Bradley J. Erickson, Judy W. Gichoya,
- Abstract summary: Generative AI offers a promising approach to generating synthetic images, enhancing dataset diversity.
This study investigates the impact of synthetic data supplementation on the performance and generalizability of medical imaging research.
- Score: 4.475998415951477
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Chest X-rays (CXR) are essential for diagnosing a variety of conditions, but when used on new populations, model generalizability issues limit their efficacy. Generative AI, particularly denoising diffusion probabilistic models (DDPMs), offers a promising approach to generating synthetic images, enhancing dataset diversity. This study investigates the impact of synthetic data supplementation on the performance and generalizability of medical imaging research. The study employed DDPMs to create synthetic CXRs conditioned on demographic and pathological characteristics from the CheXpert dataset. These synthetic images were used to supplement training datasets for pathology classifiers, with the aim of improving their performance. The evaluation involved three datasets (CheXpert, MIMIC-CXR, and Emory Chest X-ray) and various experiments, including supplementing real data with synthetic data, training with purely synthetic data, and mixing synthetic data with external datasets. Performance was assessed using the area under the receiver operating curve (AUROC). Adding synthetic data to real datasets resulted in a notable increase in AUROC values (up to 0.02 in internal and external test sets with 1000% supplementation, p-value less than 0.01 in all instances). When classifiers were trained exclusively on synthetic data, they achieved performance levels comparable to those trained on real data with 200%-300% data supplementation. The combination of real and synthetic data from different sources demonstrated enhanced model generalizability, increasing model AUROC from 0.76 to 0.80 on the internal test set (p-value less than 0.01). In conclusion, synthetic data supplementation significantly improves the performance and generalizability of pathology classifiers in medical imaging.
Related papers
- Can Medical Vision-Language Pre-training Succeed with Purely Synthetic Data? [8.775988650381397]
Training medical vision-language pre-training models requires datasets with paired, high-quality image-text data.
Recent advancements in Large Language Models have made it possible to generate large-scale synthetic image-text pairs.
We propose an automated pipeline to build a diverse, high-quality synthetic dataset.
arXiv Detail & Related papers (2024-10-17T13:11:07Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Synthetic Data as Validation [9.506660694536649]
We illustrate the effectiveness of synthetic data for early cancer detection in computed tomography (CT) volumes.
We establish a new continual learning framework that continuously trains AI models on a stream of out-domain data with synthetic tumors.
The AI model trained and validated in dynamically expanding synthetic data can consistently outperform models trained and validated exclusively on real-world data.
arXiv Detail & Related papers (2023-10-24T17:59:55Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Bridging the Gap: Enhancing the Utility of Synthetic Data via
Post-Processing Techniques [7.967995669387532]
generative models have emerged as a promising solution for generating synthetic datasets that can replace or augment real-world data.
We propose three novel post-processing techniques to improve the quality and diversity of the synthetic dataset.
Experiments show that Gap Filler (GaFi) effectively reduces the gap with real-accuracy scores to an error of 2.03%, 1.78%, and 3.99% on the Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets, respectively.
arXiv Detail & Related papers (2023-05-17T10:50:38Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Differentially Private Diffusion Models Generate Useful Synthetic Images [53.94025967603649]
Recent studies have found that, by default, the outputs of some diffusion models do not preserve training data privacy.
By privately fine-tuning ImageNet pre-trained diffusion models with more than 80M parameters, we obtain SOTA results on CIFAR-10 and Camelyon17.
Our results demonstrate that diffusion models fine-tuned with differential privacy can produce useful and provably private synthetic data.
arXiv Detail & Related papers (2023-02-27T15:02:04Z) - Evaluation of the Synthetic Electronic Health Records [3.255030588361125]
This work outlines two metrics called Similarity and Uniqueness for sample-wise assessment of synthetic datasets.
We demonstrate the proposed notions with several state-of-the-art generative models to synthesise Cystic Fibrosis (CF) patients' electronic health records.
arXiv Detail & Related papers (2022-10-16T22:46:08Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Deep learning-based COVID-19 pneumonia classification using chest CT
images: model generalizability [54.86482395312936]
Deep learning (DL) classification models were trained to identify COVID-19-positive patients on 3D computed tomography (CT) datasets from different countries.
We trained nine identical DL-based classification models by using combinations of the datasets with a 72% train, 8% validation, and 20% test data split.
The models trained on multiple datasets and evaluated on a test set from one of the datasets used for training performed better.
arXiv Detail & Related papers (2021-02-18T21:14:52Z) - Overcoming Barriers to Data Sharing with Medical Image Generation: A
Comprehensive Evaluation [17.983449515155414]
We utilize Generative Adversarial Networks (GANs) to create derived medical imaging datasets consisting entirely of synthetic patient data.
The synthetic images ideally have, in aggregate, similar statistical properties to those of a source dataset but do not contain sensitive personal information.
We measure the synthetic image quality by the performance difference of predictive models trained on either the synthetic or the real dataset.
arXiv Detail & Related papers (2020-11-29T15:41:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.