RL4Med-DDPO: Reinforcement Learning for Controlled Guidance Towards Diverse Medical Image Generation using Vision-Language Foundation Models
- URL: http://arxiv.org/abs/2503.15784v1
- Date: Thu, 20 Mar 2025 01:51:05 GMT
- Title: RL4Med-DDPO: Reinforcement Learning for Controlled Guidance Towards Diverse Medical Image Generation using Vision-Language Foundation Models
- Authors: Parham Saremi, Amar Kumar, Mohammed Mohammed, Zahra TehraniNasab, Tal Arbel,
- Abstract summary: Vision-Language Foundation Models (VLFM) have shown a tremendous increase in performance in terms of generating high-resolution, photorealistic natural images.<n>We propose a multi-stage architecture where a pre-trained VLFM provides a cursory semantic understanding, while a reinforcement learning algorithm refines the alignment through an iterative process.<n>We demonstrate the effectiveness of our method on a medical imaging skin dataset where the generated images exhibit improved generation quality and alignment with prompt over the fine-tuned Stable Diffusion.
- Score: 0.7165255458140439
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Foundation Models (VLFM) have shown a tremendous increase in performance in terms of generating high-resolution, photorealistic natural images. While VLFMs show a rich understanding of semantic content across modalities, they often struggle with fine-grained alignment tasks that require precise correspondence between image regions and textual descriptions a limitation in medical imaging, where accurate localization and detection of clinical features are essential for diagnosis and analysis. To address this issue, we propose a multi-stage architecture where a pre-trained VLFM provides a cursory semantic understanding, while a reinforcement learning (RL) algorithm refines the alignment through an iterative process that optimizes for understanding semantic context. The reward signal is designed to align the semantic information of the text with synthesized images. We demonstrate the effectiveness of our method on a medical imaging skin dataset where the generated images exhibit improved generation quality and alignment with prompt over the fine-tuned Stable Diffusion. We also show that the synthesized samples could be used to improve disease classifier performance for underrepresented subgroups through augmentation.
Related papers
- CLIP-IT: CLIP-based Pairing for Histology Images Classification [6.855390956571216]
We introduce CLIP-IT to train a vision backbone model to classify histology images by pairing them with privileged textual information from an external source.
At first, the modality pairing step relies on a CLIP-based model to match histology images with semantically relevant textual report data from external sources, creating an augmented multimodal dataset.
A parameter-efficient fine-tuning method is used to efficiently address the misalignment between the main (image) and paired (text) modalities.
arXiv Detail & Related papers (2025-04-22T18:14:43Z) - Leveraging Vision-Language Embeddings for Zero-Shot Learning in Histopathology Images [7.048241543461529]
We propose a novel framework called Multi-Resolution Prompt-guided Hybrid Embedding (MR-PHE) to address these challenges in zero-shot histopathology image classification.<n>We introduce a hybrid embedding strategy that integrates global image embeddings with weighted patch embeddings.<n>A similarity-based patch weighting mechanism assigns attention-like weights to patches based on their relevance to class embeddings.
arXiv Detail & Related papers (2025-03-13T12:18:37Z) - MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations [13.991376926757036]
We propose MedUnifier, a unified Vision-Language Pre-Training framework tailored for medical data.
MedUnifier seamlessly integrates text-grounded image generation capabilities with multi-modal learning strategies.
Our approach employs visual vector quantization, which not only facilitates a more cohesive learning strategy for cross-modal understanding but also enhances multi-modal generation quality.
arXiv Detail & Related papers (2025-03-02T21:09:32Z) - GCS-M3VLT: Guided Context Self-Attention based Multi-modal Medical Vision Language Transformer for Retinal Image Captioning [3.5948668755510136]
We propose a novel vision-language model for retinal image captioning that combines visual and textual features through a guided context self-attention mechanism.
Experiments on the DeepEyeNet dataset demonstrate a 0.023 BLEU@4 improvement, along with significant qualitative advancements.
arXiv Detail & Related papers (2024-12-23T03:49:29Z) - MRGen: Segmentation Data Engine For Underrepresented MRI Modalities [59.61465292965639]
Training medical image segmentation models for rare yet clinically significant imaging modalities is challenging due to the scarcity of annotated data.<n>This paper investigates leveraging generative models to synthesize training data, to train segmentation models for underrepresented modalities.
arXiv Detail & Related papers (2024-12-04T16:34:22Z) - A Multimodal Approach Combining Structural and Cross-domain Textual Guidance for Weakly Supervised OCT Segmentation [12.948027961485536]
We propose a novel Weakly Supervised Semantic (WSSS) approach that integrates structural guidance with text-driven strategies to generate high-quality pseudo labels.
Our method achieves state-of-the-art performance, highlighting its potential to improve diagnostic accuracy and efficiency in medical imaging.
arXiv Detail & Related papers (2024-11-19T16:20:27Z) - OTCXR: Rethinking Self-supervised Alignment using Optimal Transport for Chest X-ray Analysis [6.4136876268620115]
Self-supervised learning (SSL) has emerged as a promising technique for analyzing medical modalities such as X-rays.<n>We propose OTCXR, a novel SSL framework that leverages optimal transport (OT) to learn dense semantic invariance.<n>We validate OTCXR's efficacy through comprehensive experiments on three publicly available chest X-ray datasets.
arXiv Detail & Related papers (2024-04-18T02:59:48Z) - Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning [64.1316997189396]
We present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images.
Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets.
arXiv Detail & Related papers (2024-03-21T17:58:56Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Joint fMRI Decoding and Encoding with Latent Embedding Alignment [77.66508125297754]
We introduce a unified framework that addresses both fMRI decoding and encoding.
Our model concurrently recovers visual stimuli from fMRI signals and predicts brain activity from images within a unified framework.
arXiv Detail & Related papers (2023-03-26T14:14:58Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Semantic segmentation of multispectral photoacoustic images using deep
learning [53.65837038435433]
Photoacoustic imaging has the potential to revolutionise healthcare.
Clinical translation of the technology requires conversion of the high-dimensional acquired data into clinically relevant and interpretable information.
We present a deep learning-based approach to semantic segmentation of multispectral photoacoustic images.
arXiv Detail & Related papers (2021-05-20T09:33:55Z) - Pathological Retinal Region Segmentation From OCT Images Using Geometric
Relation Based Augmentation [84.7571086566595]
We propose improvements over previous GAN-based medical image synthesis methods by jointly encoding the intrinsic relationship of geometry and shape.
The proposed method outperforms state-of-the-art segmentation methods on the public RETOUCH dataset having images captured from different acquisition procedures.
arXiv Detail & Related papers (2020-03-31T11:50:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.