Related papers: A Multimodal LLM Approach for Visual Question Answering on Multiparametric 3D Brain MRI

A Multimodal LLM Approach for Visual Question Answering on Multiparametric 3D Brain MRI

URL: http://arxiv.org/abs/2509.25889v2
Date: Wed, 01 Oct 2025 03:37:48 GMT
Title: A Multimodal LLM Approach for Visual Question Answering on Multiparametric 3D Brain MRI
Authors: Arvind Murari Vepa, Yannan Yu, Jingru Gan, Anthony Cuturrufo, Weikai Li, Wei Wang, Fabien Scalzo, Yizhou Sun,
Abstract summary: mpLLM is a prompt-conditioned hierarchical mixture-of-experts architecture for visual question answering over 3D brain MRI.<n> mpLLM routes across modality-level and token-level projection experts to fuse multiple interrelated 3D modalities.<n> mpLLM outperforms strong medical VLM baselines by 5.3% on average across multiple mpMRI datasets.
Score: 31.111739327390925
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce mpLLM, a prompt-conditioned hierarchical mixture-of-experts (MoE) architecture for visual question answering over multi-parametric 3D brain MRI (mpMRI). mpLLM routes across modality-level and token-level projection experts to fuse multiple interrelated 3D modalities, enabling efficient training without image-report pretraining. To address limited image-text paired supervision, mpLLM integrates a synthetic visual question answering (VQA) protocol that generates medically relevant VQA from segmentation annotations, and we collaborate with medical experts for clinical validation. mpLLM outperforms strong medical VLM baselines by 5.3% on average across multiple mpMRI datasets. Our study features three main contributions: (1) the first clinically validated VQA dataset for 3D brain mpMRI, (2) a novel multimodal LLM that handles multiple interrelated 3D modalities, and (3) strong empirical results that demonstrate the medical utility of our methodology. Ablations highlight the importance of modality-level and token-level experts and prompt-conditioned routing.

Related papers

3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection [0.31351527202068447]
We propose MedMAP, a framework that enhances vision-language representation learning in 3D MRI.<n>MedMAP comprises a modality-aware vision-language alignment stage and a fine-tuning stage for multi-organ abnormality detection.<n>Our experiments on MedMoM-MRI3D demonstrate that MedMAP significantly outperforms existing VLMs in 3D MRI-based multi-organ abnormality detection.
arXiv Detail & Related papers (2026-02-27T03:37:55Z)
Multimodal Visual Surrogate Compression for Alzheimer's Disease Classification [69.87877580725768]
Multimodal Visual Surrogate Compression (MVSC) learns to compress and adapt large 3D sMRI volumes into compact 2D features.<n>MVSC has two key components: a Volume Context that captures global cross-slice context under textual guidance, and an Adaptive Slice Fusion module that aggregates slice-level information in a text-enhanced, patch-wise manner.
arXiv Detail & Related papers (2026-01-29T13:05:46Z)
MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation [11.762545584252052]
We propose a unified 3D medical multimodal model that supports report generation, VQA, and multi-paradigm segmentation.<n>MedVL-SAM2 integrates image-level reasoning and pixel-level perception through a cohesive architecture tailored for 3D medical imaging.<n>Our unified architecture delivers state-of-the-art performance across report generation, VQA, and multiple 3D segmentation tasks.
arXiv Detail & Related papers (2026-01-14T21:21:00Z)
CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering [16.115735955158428]
Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine.<n>Recent self-attention based methods struggle to handle cross-modal semantic alignments between vision and language.<n>We introduce a Cross-Mamba Interaction based Multi-Task Learning framework that learns cross-modal feature representations from images and texts.
arXiv Detail & Related papers (2025-11-03T09:05:16Z)
MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning [52.064286116035134]
We develop MedAlign, a framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA)<n>We first propose a multimodal Direct Preference Optimization (mDPO) objective to align preference learning with visual context.<n>We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM.
arXiv Detail & Related papers (2025-10-24T02:11:05Z)
Triplet-Structured Knowledge Integration for Multi-Turn Medical Reasoning [21.44813166265882]
Large Language Models (LLMs) have shown strong performance on static medical Question Answering (QA) tasks.<n>This paper introduces TriMediQ, a triplet-structured approach that enhances the reasoning reliability of LLMs.<n> Experiments on two interactive medical QA benchmarks show that TriMediQ achieves up to 10.4% improvement in accuracy over five existing baselines.
arXiv Detail & Related papers (2025-10-03T22:11:17Z)
M3Ret: Unleashing Zero-shot Multimodal Medical Image Retrieval via Self-Supervision [24.846428105192405]
We train M3Ret, a unified visual encoder, without any modality-specific customization.<n>It successfully learns transferable representations using both generative (MAE) and contrastive (SimDINO) self-supervised learning (SSL) paradigms.<n>Our approach sets a new state-of-the-art in zero-shot image-to-image retrieval across all individual modalities, surpassing strong baselines such as DINOv3 and the text-supervised BMC-CLIP.
arXiv Detail & Related papers (2025-09-01T10:59:39Z)
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning [57.873833577058]
We build a multimodal dataset enriched with extensive medical knowledge.<n>We then introduce our medical-specialized MLLM: Lingshu.<n>Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities.
arXiv Detail & Related papers (2025-06-08T08:47:30Z)
MEDMKG: Benchmarking Medical Knowledge Exploitation with Multimodal Knowledge Graph [28.79000907242469]
We propose MEDMKG, a Medical Multimodal Knowledge Graph that unifies visual and textual medical information through a multi-stage construction pipeline.<n>We evaluate MEDMKG across three tasks under two experimental settings, benchmarking twenty-four baseline methods and four state-of-the-art vision-language backbones on six datasets.<n>Results show that MEDMKG not only improves performance in downstream medical tasks but also offers a strong foundation for developing adaptive and robust strategies for multimodal knowledge integration in medical artificial intelligence.
arXiv Detail & Related papers (2025-05-22T18:41:46Z)
Zeus: Zero-shot LLM Instruction for Union Segmentation in Multimodal Medical Imaging [4.341503087761129]
Conducting multimodal learning involves visual and text modalities shown as a solution, but collecting paired vision-language datasets is expensive and time-consuming.<n>Inspired by the superior ability in numerous cross-modal tasks for Large Language Models (LLMs), we proposed a novel Vision-LLM union framework to address the issues.
arXiv Detail & Related papers (2025-04-09T23:33:35Z)
Towards General Text-guided Image Synthesis for Customized Multimodal Brain MRI Generation [51.28453192441364]
Multimodal brain magnetic resonance (MR) imaging is indispensable in neuroscience and neurology. Current MR image synthesis approaches are typically trained on independent datasets for specific tasks. We present TUMSyn, a Text-guided Universal MR image Synthesis model, which can flexibly generate brain MR images.
arXiv Detail & Related papers (2024-09-25T11:14:47Z)
Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models [17.643421997037514]
We propose a novel framework that tackles both discriminative and generative multimodal medical tasks. The learning of Med-MoE consists of three steps: multimodal medical alignment, instruction tuning and routing, and domain-specific MoE tuning. Our model can achieve performance superior to or on par with state-of-the-art baselines.
arXiv Detail & Related papers (2024-04-16T02:35:17Z)
SDR-Former: A Siamese Dual-Resolution Transformer for Liver Lesion Classification Using 3D Multi-Phase Imaging [59.78761085714715]
This study proposes a novel Siamese Dual-Resolution Transformer (SDR-Former) framework for liver lesion classification. The proposed framework has been validated through comprehensive experiments on two clinical datasets. To support the scientific community, we are releasing our extensive multi-phase MR dataset for liver lesion analysis to the public.
arXiv Detail & Related papers (2024-02-27T06:32:56Z)
Source-Free Collaborative Domain Adaptation via Multi-Perspective Feature Enrichment for Functional MRI Analysis [55.03872260158717]
Resting-state MRI functional (rs-fMRI) is increasingly employed in multi-site research to aid neurological disorder analysis. Many methods have been proposed to reduce fMRI heterogeneity between source and target domains. But acquiring source data is challenging due to concerns and/or data storage burdens in multi-site studies. We design a source-free collaborative domain adaptation framework for fMRI analysis, where only a pretrained source model and unlabeled target data are accessible.
arXiv Detail & Related papers (2023-08-24T01:30:18Z)
Video4MRI: An Empirical Study on Brain Magnetic Resonance Image Analytics with CNN-based Video Classification Frameworks [60.42012344842292]
3D CNN-based models dominate the field of magnetic resonance image (MRI) analytics. In this paper, four datasets of Alzheimer's and Parkinson's disease recognition are utilized in experiments. In terms of efficiency, the video framework performs better than 3D-CNN models by 5% - 11% with 50% - 66% less trainable parameters.
arXiv Detail & Related papers (2023-02-24T15:26:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.