Related papers: Can Medical Vision-Language Pre-training Succeed with Purely Synthetic Data?

Can Medical Vision-Language Pre-training Succeed with Purely Synthetic Data?

URL: http://arxiv.org/abs/2410.13523v1
Date: Thu, 17 Oct 2024 13:11:07 GMT
Title: Can Medical Vision-Language Pre-training Succeed with Purely Synthetic Data?
Authors: Che Liu, Zhongwei Wan, Haozhe Wang, Yinda Chen, Talha Qaiser, Chen Jin, Fariba Yousefi, Nikolay Burlutskiy, Rossella Arcucci,
Abstract summary: Training medical vision-language pre-training models requires datasets with paired, high-quality image-text data. Recent advancements in Large Language Models have made it possible to generate large-scale synthetic image-text pairs. We propose an automated pipeline to build a diverse, high-quality synthetic dataset.
Score: 8.775988650381397
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Medical Vision-Language Pre-training (MedVLP) has made significant progress in enabling zero-shot tasks for medical image understanding. However, training MedVLP models typically requires large-scale datasets with paired, high-quality image-text data, which are scarce in the medical domain. Recent advancements in Large Language Models (LLMs) and diffusion models have made it possible to generate large-scale synthetic image-text pairs. This raises the question: *Can MedVLP succeed using purely synthetic data?* To address this, we use off-the-shelf generative models to create synthetic radiology reports and paired Chest X-ray (CXR) images, and propose an automated pipeline to build a diverse, high-quality synthetic dataset, enabling a rigorous study that isolates model and training settings, focusing entirely from the data perspective. Our results show that MedVLP models trained *exclusively on synthetic data* outperform those trained on real data by **3.8%** in averaged AUC on zero-shot classification. Moreover, using a combination of synthetic and real data leads to a further improvement of **9.07%**. Additionally, MedVLP models trained on synthetic or mixed data consistently outperform those trained on real data in zero-shot grounding, as well as in fine-tuned classification and segmentation tasks. Our analysis suggests MedVLP trained on well-designed synthetic data can outperform models trained on real datasets, which may be limited by low-quality samples and long-tailed distributions.

Related papers

Prototype-Guided Diffusion for Digital Pathology: Achieving Foundation Model Performance with Minimal Clinical Data [6.318463500874778]
We propose a prototype-guided diffusion model to generate high-fidelity synthetic pathology data at scale. Our approach ensures biologically and diagnostically meaningful variations in the generated data. We demonstrate that self-supervised features trained on our synthetic dataset achieve competitive performance despite using 60x-760x less data than models trained on large real-world datasets.
arXiv Detail & Related papers (2025-04-15T21:17:39Z)
Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
Data-Constrained Synthesis of Training Data for De-Identification [0.0]
We domain-adapt large language models (LLMs) to the clinical domain. We generate synthetic clinical texts that are machine-annotated with tags for personally identifiable information. The synthetic corpora are then used to train synthetic NER models.
arXiv Detail & Related papers (2025-02-20T16:09:27Z)
License Plate Images Generation with Diffusion Models [4.151073288078749]
We propose to synthesize realistic license plates (LPs) using diffusion models inspired by video generation. We have prepared a synthetic dataset consisting of 10,000 LP images, publicly available at https://zenodo.org/doi/10.528/zenodo.13342102.
arXiv Detail & Related papers (2025-01-06T20:22:18Z)
Merging synthetic and real embryo data for advanced AI predictions [69.07284335967019]
We train two generative models using two datasets-one we created and made publicly available, and one existing public dataset-to generate synthetic embryo images at various cell stages. These were combined with real images to train classification models for embryo cell stage prediction. Our results demonstrate that incorporating synthetic images alongside real data improved classification performance, with the model achieving 97% accuracy compared to 94.5% when trained solely on real data.
arXiv Detail & Related papers (2024-12-02T08:24:49Z)
Evaluating Utility of Memory Efficient Medical Image Generation: A Study on Lung Nodule Segmentation [0.0]
This work proposes a memory-efficient patch-wise denoising diffusion probabilistic model (DDPM) for generating synthetic medical images. Our approach generates high-utility synthetic images with nodule segmentation while efficiently managing memory constraints. We evaluate the method in two scenarios: training a segmentation model exclusively on synthetic data, and augmenting real-world training data with synthetic images.
arXiv Detail & Related papers (2024-10-16T13:20:57Z)
Chatting Up Attachment: Using LLMs to Predict Adult Bonds [0.0]
We use GPT-4 and Claude 3 Opus to create agents that simulate adults with varying profiles, childhood memories, and attachment styles. We evaluate our models using a transcript dataset from 9 humans who underwent the same interview protocol, analyzed and labeled by mental health professionals. Our findings indicate that training the models using only synthetic data achieves performance comparable to training the models on human data.
arXiv Detail & Related papers (2024-08-31T04:29:19Z)
Unconditional Latent Diffusion Models Memorize Patient Imaging Data: Implications for Openly Sharing Synthetic Data [2.1375651880073834]
generative AI models have been gaining traction for facilitating open-data sharing. These models generate patient data copies instead of novel synthetic samples. We train 2D and 3D latent diffusion models on CT, MR, and X-ray datasets for synthetic data generation.
arXiv Detail & Related papers (2024-02-01T22:58:21Z)
Synthetically Enhanced: Unveiling Synthetic Data's Potential in Medical Imaging Research [4.475998415951477]
Generative AI offers a promising approach to generating synthetic images, enhancing dataset diversity. This study investigates the impact of synthetic data supplementation on the performance and generalizability of medical imaging research.
arXiv Detail & Related papers (2023-11-15T21:58:01Z)
Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data. We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap. Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z)
How Good Are Synthetic Medical Images? An Empirical Study with Lung Ultrasound [0.3312417881789094]
Adding synthetic training data using generative models offers a low-cost method to deal with the data scarcity challenge. We show that training with both synthetic and real data outperforms training with real data alone.
arXiv Detail & Related papers (2023-10-05T15:42:53Z)
PathLDM: Text conditioned Latent Diffusion Model for Histopathology [62.970593674481414]
We introduce PathLDM, the first text-conditioned Latent Diffusion Model tailored for generating high-quality histopathology images. Our approach fuses image and textual data to enhance the generation process. We achieved a SoTA FID score of 7.64 for text-to-image generation on the TCGA-BRCA dataset, significantly outperforming the closest text-conditioned competitor with FID 30.1.
arXiv Detail & Related papers (2023-09-01T22:08:32Z)
Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task. We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)
PromDA: Prompt-based Data Augmentation for Low-Resource NLU Tasks [61.51515750218049]
This paper focuses on the Data Augmentation for low-resource Natural Language Understanding (NLU) tasks. We propose Prompt-based Data Augmentation model (PromDA) which only trains small-scale Soft Prompt. PromDA generates synthetic data via two different views and filters out the low-quality data using NLU models.
arXiv Detail & Related papers (2022-02-25T05:09:27Z)
Self-Training with Improved Regularization for Sample-Efficient Chest X-Ray Classification [80.00316465793702]
We present a deep learning framework that enables robust modeling in challenging scenarios. Our results show that using 85% lesser labeled data, we can build predictive models that match the performance of classifiers trained in a large-scale data setting.
arXiv Detail & Related papers (2020-05-03T02:36:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.