Prompt to Polyp: Medical Text-Conditioned Image Synthesis with Diffusion Models
- URL: http://arxiv.org/abs/2505.05573v2
- Date: Mon, 12 May 2025 17:59:28 GMT
- Title: Prompt to Polyp: Medical Text-Conditioned Image Synthesis with Diffusion Models
- Authors: Mikhail Chaichuk, Sushant Gautam, Steven Hicks, Elena Tutubalina,
- Abstract summary: The generation of realistic medical images from text descriptions has significant potential to address data scarcity challenges in healthcare AI.<n>This paper presents a comprehensive study of text-to-image synthesis in the medical domain, comparing two distinct approaches.<n>We introduce a novel model named MSDM, an optimized architecture that integrates a clinical text encoder, variational autoencoder, and cross-attention mechanisms.
- Score: 5.12801085802078
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The generation of realistic medical images from text descriptions has significant potential to address data scarcity challenges in healthcare AI while preserving patient privacy. This paper presents a comprehensive study of text-to-image synthesis in the medical domain, comparing two distinct approaches: (1) fine-tuning large pre-trained latent diffusion models and (2) training small, domain-specific models. We introduce a novel model named MSDM, an optimized architecture based on Stable Diffusion that integrates a clinical text encoder, variational autoencoder, and cross-attention mechanisms to better align medical text prompts with generated images. Our study compares two approaches: fine-tuning large pre-trained models (FLUX, Kandinsky) versus training compact domain-specific models (MSDM). Evaluation across colonoscopy (MedVQA-GI) and radiology (ROCOv2) datasets reveals that while large models achieve higher fidelity, our optimized MSDM delivers comparable quality with lower computational costs. Quantitative metrics and qualitative evaluations by medical experts reveal strengths and limitations of each approach.
Related papers
- VICCA: Visual Interpretation and Comprehension of Chest X-ray Anomalies in Generated Report Without Human Feedback [1.5839621757142595]
We propose a novel framework designed to enhance the semantic alignment and localization accuracy of AI-generated medical reports.<n>By comparing features between the original and generated images, we introduce a dual-scoring system.<n>This approach significantly outperforms existing methods, achieving state-of-the-art results in pathology localization and text-to-image alignment.
arXiv Detail & Related papers (2025-01-29T16:02:16Z) - Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis [55.959002385347645]
Latent Drifting enables diffusion models to be conditioned for medical images fitted for the complex task of counterfactual image generation.<n>We evaluate our method on three public longitudinal benchmark datasets of brain MRI and chest X-rays for counterfactual image generation.
arXiv Detail & Related papers (2024-12-30T01:59:34Z) - Cross-conditioned Diffusion Model for Medical Image to Image Translation [22.020931436223204]
We introduce a Cross-conditioned Diffusion Model (CDM) for medical image-to-image translation.
First, we propose a Modality-specific Representation Model (MRM) to model the distribution of target modalities.
Then, we design a Modality-decoupled Diffusion Network (MDN) to efficiently and effectively learn the distribution from MRM.
arXiv Detail & Related papers (2024-09-13T02:48:56Z) - MediSyn: A Generalist Text-Guided Latent Diffusion Model For Diverse Medical Image Synthesis [4.541407789437896]
MediSyn is a text-guided latent diffusion model capable of generating synthetic images from 6 medical specialties and 10 image types.<n>A direct comparison of the synthetic images against the real images confirms that our model synthesizes novel images and, crucially, may preserve patient privacy.<n>Our findings highlight the immense potential for generalist image generative models to accelerate algorithmic research and development in medicine.
arXiv Detail & Related papers (2024-05-16T04:28:44Z) - Improving Medical Report Generation with Adapter Tuning and Knowledge
Enhancement in Vision-Language Foundation Models [26.146579369491718]
This study builds upon the state-of-the-art vision-language pre-training and fine-tuning approach, BLIP-2, to customize general large-scale foundation models.
Validation on the dataset of ImageCLEFmedical 2023 demonstrates our model's prowess, achieving the best-averaged results against several state-of-the-art methods.
arXiv Detail & Related papers (2023-12-07T01:01:45Z) - DiffBoost: Enhancing Medical Image Segmentation via Text-Guided Diffusion Model [3.890243179348094]
Large-scale, big-variant, high-quality data are crucial for developing robust and successful deep-learning models for medical applications.<n>This paper proposes a novel approach by developing controllable diffusion models for medical image synthesis, called DiffBoost.<n>We leverage recent diffusion probabilistic models to generate realistic and diverse synthetic medical image data.
arXiv Detail & Related papers (2023-10-19T16:18:02Z) - PathLDM: Text conditioned Latent Diffusion Model for Histopathology [62.970593674481414]
We introduce PathLDM, the first text-conditioned Latent Diffusion Model tailored for generating high-quality histopathology images.
Our approach fuses image and textual data to enhance the generation process.
We achieved a SoTA FID score of 7.64 for text-to-image generation on the TCGA-BRCA dataset, significantly outperforming the closest text-conditioned competitor with FID 30.1.
arXiv Detail & Related papers (2023-09-01T22:08:32Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Customizing General-Purpose Foundation Models for Medical Report
Generation [64.31265734687182]
The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks.
We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
arXiv Detail & Related papers (2023-06-09T03:02:36Z) - RoentGen: Vision-Language Foundation Model for Chest X-ray Generation [7.618389245539657]
We develop a strategy to overcome the large natural-medical distributional shift by adapting a pre-trained latent diffusion model on a corpus of publicly available chest x-rays.
We investigate the model's ability to generate high-fidelity, diverse synthetic CXR conditioned on text prompts.
We present evidence that the resulting model (RoentGen) is able to create visually convincing, diverse synthetic CXR images.
arXiv Detail & Related papers (2022-11-23T06:58:09Z) - Adapting Pretrained Vision-Language Foundational Models to Medical
Imaging Domains [3.8137985834223502]
Building generative models for medical images that faithfully depict clinical context may help alleviate the paucity of healthcare datasets.
We explore the sub-components of the Stable Diffusion pipeline to fine-tune the model to generate medical images.
Our best-performing model improves upon the stable diffusion baseline and can be conditioned to insert a realistic-looking abnormality on a synthetic radiology image.
arXiv Detail & Related papers (2022-10-09T01:43:08Z) - Many-to-One Distribution Learning and K-Nearest Neighbor Smoothing for
Thoracic Disease Identification [83.6017225363714]
deep learning has become the most powerful computer-aided diagnosis technology for improving disease identification performance.
For chest X-ray imaging, annotating large-scale data requires professional domain knowledge and is time-consuming.
In this paper, we propose many-to-one distribution learning (MODL) and K-nearest neighbor smoothing (KNNS) methods to improve a single model's disease identification performance.
arXiv Detail & Related papers (2021-02-26T02:29:30Z) - Predicting Clinical Diagnosis from Patients Electronic Health Records
Using BERT-based Neural Networks [62.9447303059342]
We show the importance of this problem in medical community.
We present a modification of Bidirectional Representations from Transformers (BERT) model for classification sequence.
We use a large-scale Russian EHR dataset consisting of about 4 million unique patient visits.
arXiv Detail & Related papers (2020-07-15T09:22:55Z) - Semi-supervised Medical Image Classification with Relation-driven
Self-ensembling Model [71.80319052891817]
We present a relation-driven semi-supervised framework for medical image classification.
It exploits the unlabeled data by encouraging the prediction consistency of given input under perturbations.
Our method outperforms many state-of-the-art semi-supervised learning methods on both single-label and multi-label image classification scenarios.
arXiv Detail & Related papers (2020-05-15T06:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.