MedGEN-Bench: Contextually entangled benchmark for open-ended multimodal medical generation
- URL: http://arxiv.org/abs/2511.13135v2
- Date: Tue, 18 Nov 2025 09:19:07 GMT
- Title: MedGEN-Bench: Contextually entangled benchmark for open-ended multimodal medical generation
- Authors: Junjie Yang, Yuhao Yan, Gang Wu, Yuxuan Wang, Ruoyu Liang, Xinjie Jiang, Xiang Wan, Fenglei Fan, Yongquan Zhang, Feiwei Qin, Changmiao Wang,
- Abstract summary: We introduce MedGEN-Bench, a comprehensive multimodal benchmark designed to advance medical AI research.<n>MedGEN-Bench comprises 6,422 expert-validated image-text pairs spanning six imaging modalities, 16 clinical tasks, and 28 subtasks.<n>We employ a novel three-tier assessment framework that integrates pixel-level metrics, semantic text analysis, and expert-guided clinical relevance scoring.
- Score: 40.05080887875327
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As Vision-Language Models (VLMs) increasingly gain traction in medical applications, clinicians are progressively expecting AI systems not only to generate textual diagnoses but also to produce corresponding medical images that integrate seamlessly into authentic clinical workflows. Despite the growing interest, existing medical visual benchmarks present notable limitations. They often rely on ambiguous queries that lack sufficient relevance to image content, oversimplify complex diagnostic reasoning into closed-ended shortcuts, and adopt a text-centric evaluation paradigm that overlooks the importance of image generation capabilities. To address these challenges, we introduce MedGEN-Bench, a comprehensive multimodal benchmark designed to advance medical AI research. MedGEN-Bench comprises 6,422 expert-validated image-text pairs spanning six imaging modalities, 16 clinical tasks, and 28 subtasks. It is structured into three distinct formats: Visual Question Answering, Image Editing, and Contextual Multimodal Generation. What sets MedGEN-Bench apart is its focus on contextually intertwined instructions that necessitate sophisticated cross-modal reasoning and open-ended generative outputs, moving beyond the constraints of multiple-choice formats. To evaluate the performance of existing systems, we employ a novel three-tier assessment framework that integrates pixel-level metrics, semantic text analysis, and expert-guided clinical relevance scoring. Using this framework, we systematically assess 10 compositional frameworks, 3 unified models, and 5 VLMs.
Related papers
- TGC-Net: A Structure-Aware and Semantically-Aligned Framework for Text-Guided Medical Image Segmentation [56.09179939570486]
We propose TGC-Net, a CLIP-based framework focusing on parameter-efficient, task-specific adaptations.<n>TGC-Net achieves state-of-the-art performance with substantially fewer trainable parameters, including notable Dice gains on challenging benchmarks.
arXiv Detail & Related papers (2025-12-24T12:06:26Z) - UniMedVL: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis [41.864457631668806]
We introduce UniMedVL, the first medical unified multimodal model for the simultaneous analysis of image understanding and generation tasks.<n>UniMedVL achieves superior performance on five medical image understanding benchmarks, while matching specialized models in generation quality across eight medical imaging modalities.
arXiv Detail & Related papers (2025-10-17T14:54:58Z) - TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models [54.48710348910535]
Existing medical reasoning benchmarks primarily focus on analyzing a patient's condition based on an image from a single visit.<n>We introduce TemMed-Bench, the first benchmark designed for analyzing changes in patients' conditions between different clinical visits.
arXiv Detail & Related papers (2025-09-29T17:51:26Z) - MedAtlas: Evaluating LLMs for Multi-Round, Multi-Task Medical Reasoning Across Diverse Imaging Modalities and Clinical Text [25.102399692530245]
We introduce MedAtlas, a novel benchmark framework to evaluate large language models on realistic medical reasoning tasks.<n> MedAtlas is characterized by four key features: multi-turn dialogue, multi-modal medical image interaction, multi-task integration, and high clinical fidelity.<n>Each case is derived from real diagnostic and incorporates temporal interactions between textual medical histories and multiple imaging modalities, including CT, MRI, PET, ultrasound, and X-ray.
arXiv Detail & Related papers (2025-08-13T17:32:17Z) - MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models [48.24824129683951]
We introduce medical image reasoning segmentation, a novel task that aims to generate segmentation masks based on complex and implicit medical instructions.<n>To address this, we propose MedSeg-R, an end-to-end framework that leverages the reasoning abilities of MLLMs to interpret clinical questions.<n>It is built on two core components: 1) a global context understanding module that interprets images and comprehends complex medical instructions to generate multi-modal intermediate tokens, and 2) a pixel-level grounding module that decodes these tokens to produce precise segmentation masks.
arXiv Detail & Related papers (2025-06-12T08:13:38Z) - RadIR: A Scalable Framework for Multi-Grained Medical Image Retrieval via Radiology Report Mining [64.66825253356869]
We propose a novel methodology that leverages dense radiology reports to define image-wise similarity ordering at multiple granularities.<n>We construct two comprehensive medical imaging retrieval datasets: MIMIC-IR for Chest X-rays and CTRATE-IR for CT scans.<n>We develop two retrieval systems, RadIR-CXR and model-ChestCT, which demonstrate superior performance in traditional image-image and image-report retrieval tasks.
arXiv Detail & Related papers (2025-03-06T17:43:03Z) - Text-Promptable Propagation for Referring Medical Image Sequence Segmentation [20.724643106195852]
Ref-MISS aims to segment anatomical structures in medical image sequences based on natural language descriptions.<n>Existing 2D and 3D segmentation models struggle to explicitly track objects of interest across medical image sequences.<n>We propose Text-Promptable Propagation (TPP), a model designed for referring medical image sequence segmentation.
arXiv Detail & Related papers (2025-02-16T12:13:11Z) - A Generative Framework for Bidirectional Image-Report Understanding in Chest Radiography [1.2289361708127877]
Multi-Stage Adaptive Vision-Language Tuning (MAViLT) is a novel framework designed to enhance multimodal reasoning and generation for vision-based understanding.<n>MAViLT incorporates a clinical gradient-weighted tokenization process and a hierarchical fine-tuning strategy, enabling it to generate accurate radiology reports, synthesize realistic CXRs from text, and answer vision-based clinical questions.<n>We evaluate MAViLT on two benchmark datasets, MIMIC-CXR and Indiana University CXR, achieving state-of-the-art results across all tasks.
arXiv Detail & Related papers (2025-02-09T15:02:57Z) - M3T: Multi-Modal Medical Transformer to bridge Clinical Context with Visual Insights for Retinal Image Medical Description Generation [1.8789068567093286]
Multi-Modal Medical Transformer (M3T) is a novel deep learning architecture that integrates visual representations with diagnostic keywords.
Experimental studies on the DeepEyeNet dataset validate the success of M3T in meeting ophthalmologists' standards.
arXiv Detail & Related papers (2024-06-19T00:46:48Z) - QUBIQ: Uncertainty Quantification for Biomedical Image Segmentation Challenge [93.61262892578067]
Uncertainty in medical image segmentation tasks, especially inter-rater variability, presents a significant challenge.
This variability directly impacts the development and evaluation of automated segmentation algorithms.
We report the set-up and summarize the benchmark results of the Quantification of Uncertainties in Biomedical Image Quantification Challenge (QUBIQ)
arXiv Detail & Related papers (2024-03-19T17:57:24Z) - Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features.
We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.