Related papers: Any-to-Any Vision-Language Model for Multimodal X-ray Imaging and Radiological Report Generation

Any-to-Any Vision-Language Model for Multimodal X-ray Imaging and Radiological Report Generation

URL: http://arxiv.org/abs/2505.01091v1
Date: Fri, 02 May 2025 08:07:24 GMT
Title: Any-to-Any Vision-Language Model for Multimodal X-ray Imaging and Radiological Report Generation
Authors: Daniele Molino, Francesco di Feola, Linlin Shen, Paolo Soda, Valerio Guarrasi,
Abstract summary: We introduce a framework specifically designed for multimodal medical data generation.<n>By enabling the generation of multi-view chest X-rays and their associated clinical report, it bridges the gap between general-purpose vision-language models and the specialized requirements of healthcare.<n>Our framework achieves comparable or even superior performance compared to real data on downstream disease classification tasks.
Score: 26.589728923739596
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Generative models have revolutionized Artificial Intelligence (AI), particularly in multimodal applications. However, adapting these models to the medical domain poses unique challenges due to the complexity of medical data and the stringent need for clinical accuracy. In this work, we introduce a framework specifically designed for multimodal medical data generation. By enabling the generation of multi-view chest X-rays and their associated clinical report, it bridges the gap between general-purpose vision-language models and the specialized requirements of healthcare. Leveraging the MIMIC-CXR dataset, the proposed framework shows superior performance in generating high-fidelity images and semantically coherent reports. Our quantitative evaluation reveals significant results in terms of FID and BLEU scores, showcasing the quality of the generated data. Notably, our framework achieves comparable or even superior performance compared to real data on downstream disease classification tasks, underlining its potential as a tool for medical research and diagnostics. This study highlights the importance of domain-specific adaptations in enhancing the relevance and utility of generative models for clinical applications, paving the way for future advancements in synthetic multimodal medical data generation.

Related papers

Causal Disentanglement for Robust Long-tail Medical Image Generation [80.15257897500578]
We propose a novel medical image generation framework, which generates independent pathological and structural features.<n>We leverage a diffusion model guided by pathological findings to model pathological features, enabling the generation of diverse counterfactual images.
arXiv Detail & Related papers (2025-04-20T01:54:18Z)
XGeM: A Multi-Prompt Foundation Model for Multimodal Medical Data Generation [22.908801443059758]
XGeM is a multimodal generative model designed to support flexible, any-to-any synthesis between medical data modalities.<n>XGeM constructs a shared latent space via contrastive learning and introduces a novel Multi-Prompt Training strategy.<n>We show how XGeM can support key medical data challenges such as anonymization, class imbalance, and data scarcity.
arXiv Detail & Related papers (2025-01-08T16:53:56Z)
Large Language Model Benchmarks in Medical Tasks [14.739357670600103]
This paper presents a survey of various benchmark datasets employed in medical large language models (LLMs) tasks.<n>The survey categorizes the datasets by modality, discussing their significance, data structure, and impact on the development of LLMs.<n>The paper emphasizes the need for datasets with a greater degree of language diversity, structured omics data, and innovative approaches to synthesis.
arXiv Detail & Related papers (2024-10-28T11:07:33Z)
PathInsight: Instruction Tuning of Multimodal Datasets and Models for Intelligence Assisted Diagnosis in Histopathology [7.87900104748629]
We have meticulously compiled a dataset of approximately 45,000 cases, covering over 6 different tasks. We have fine-tuned multimodal large models, specifically LLaVA, Qwen-VL, InternLM, with this dataset to enhance instruction-based performance.
arXiv Detail & Related papers (2024-08-13T17:05:06Z)
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports [51.45762396192655]
Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence for computer vision. This study evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets.
arXiv Detail & Related papers (2024-07-08T09:08:42Z)
EMERGE: Enhancing Multimodal Electronic Health Records Predictive Modeling with Retrieval-Augmented Generation [22.94521527609479]
EMERGE is a Retrieval-Augmented Generation (RAG) driven framework to enhance multimodal EHR predictive modeling.<n>We extract entities from time-series data and clinical notes by prompting Large Language Models (LLMs) and align them with professional PrimeKG.<n>The extracted knowledge is then used to generate task-relevant summaries of patients' health statuses.
arXiv Detail & Related papers (2024-05-27T10:53:15Z)
Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed. In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset. We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z)
Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features. We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z)
Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology. For training, we assemble a large dataset of over 697 thousand radiology image-text pairs. For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation. The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z)
REALM: RAG-Driven Enhancement of Multimodal Electronic Health Records Analysis via Large Language Models [19.62552013839689]
Existing models often lack the medical context relevent to clinical tasks, prompting the incorporation of external knowledge. We propose REALM, a Retrieval-Augmented Generation (RAG) driven framework to enhance multimodal EHR representations. Our experiments on MIMIC-III mortality and readmission tasks showcase the superior performance of our REALM framework over baselines.
arXiv Detail & Related papers (2024-02-10T18:27:28Z)
Beyond Images: An Integrative Multi-modal Approach to Chest X-Ray Report Generation [47.250147322130545]
Image-to-text radiology report generation aims to automatically produce radiology reports that describe the findings in medical images. Most existing methods focus solely on the image data, disregarding the other patient information accessible to radiologists. We present a novel multi-modal deep neural network framework for generating chest X-rays reports by integrating structured patient data, such as vital signs and symptoms, alongside unstructured clinical notes.
arXiv Detail & Related papers (2023-11-18T14:37:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.