Related papers: Why Text Prevails: Vision May Undermine Multimodal Medical Decision Making

Why Text Prevails: Vision May Undermine Multimodal Medical Decision Making

URL: http://arxiv.org/abs/2512.13747v1
Date: Mon, 15 Dec 2025 03:09:31 GMT
Title: Why Text Prevails: Vision May Undermine Multimodal Medical Decision Making
Authors: Siyuan Dai, Lunxiao Li, Kun Zhao, Eardi Lila, Paul K. Crane, Heng Huang, Dongkuan Xu, Haoteng Tang, Liang Zhan,
Abstract summary: We show that even state-of-the-art multimodal large language models (MLLMs) struggle with basic Medical Decision Making (MDM) tasks.<n>Our empirical study shows that text-only reasoning consistently outperforms vision-only or vision-text settings.<n>These findings point to promising directions for improving multimodal decision making in healthcare.
Score: 47.976936248969366
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: With the rapid progress of large language models (LLMs), advanced multimodal large language models (MLLMs) have demonstrated impressive zero-shot capabilities on vision-language tasks. In the biomedical domain, however, even state-of-the-art MLLMs struggle with basic Medical Decision Making (MDM) tasks. We investigate this limitation using two challenging datasets: (1) three-stage Alzheimer's disease (AD) classification (normal, mild cognitive impairment, dementia), where category differences are visually subtle, and (2) MIMIC-CXR chest radiograph classification with 14 non-mutually exclusive conditions. Our empirical study shows that text-only reasoning consistently outperforms vision-only or vision-text settings, with multimodal inputs often performing worse than text alone. To mitigate this, we explore three strategies: (1) in-context learning with reason-annotated exemplars, (2) vision captioning followed by text-only inference, and (3) few-shot fine-tuning of the vision tower with classification supervision. These findings reveal that current MLLMs lack grounded visual understanding and point to promising directions for improving multimodal decision making in healthcare.

Related papers

Enhancing Medical Large Vision-Language Models via Alignment Distillation [30.592211423687246]
We propose MEDALIGN to transfer visual alignment knowledge from a domain-specific Contrastive Language-Image Pre-training model to Med-LVLMs.<n>Experiments on medical report generation and medical visual question answering benchmarks show that MEDALIGN consistently improves both performance and interpretability.
arXiv Detail & Related papers (2025-12-21T00:57:13Z)
Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation [25.148217482604746]
We propose VALOR:Visual Alignment of Medical Vision-Language Models for Radiology Report Generation.<n>Our method introduces a reinforcement learning-based post-alignment framework utilizing Group-Relative Proximal Optimization (GRPO)<n>Experiments on multiple benchmarks demonstrate that VALOR substantially improves factual accuracy and visual grounding, achieving significant performance gains over state-of-the-art report generation methods.
arXiv Detail & Related papers (2025-12-18T05:48:21Z)
TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models [54.48710348910535]
Existing medical reasoning benchmarks primarily focus on analyzing a patient's condition based on an image from a single visit.<n>We introduce TemMed-Bench, the first benchmark designed for analyzing changes in patients' conditions between different clinical visits.
arXiv Detail & Related papers (2025-09-29T17:51:26Z)
VELVET-Med: Vision and Efficient Language Pre-training for Volumetric Imaging Tasks in Medicine [11.993301266706139]
We propose a vision-language pre-training framework, termed as textbfVELVET-Med, specifically designed for limited volumetric data such as 3D CT and associated radiology reports.<n>Our approach seeks to uncover rich spatial and semantic relationships embedded in volumetric medical images and corresponding clinical narratives.<n>The resulting encoders exhibit strong transferability, achieving state-of-the-art performance across a wide range of downstream tasks.
arXiv Detail & Related papers (2025-08-16T17:08:43Z)
EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow [43.82288530883818]
EH-Benchmark is a novel ophthalmology benchmark designed to evaluate hallucinations in Medical Large Language Models.<n>We categorize hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition.<n>Our framework significantly mitigates both types of hallucinations, enhancing accuracy, interpretability, and reliability.
arXiv Detail & Related papers (2025-07-24T12:07:36Z)
From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation [48.45209969191245]
Vision-language models (VLMs) provide semantic context through textual descriptions but lack explanation precision required.<n>We propose a teacher-student framework that integrates both gaze and language supervision, leveraging their complementary strengths.<n>Our method achieves Dice scores of 80.78%, 80.53%, and 84.22%, respectively, improving 3-5% over gaze baselines without increasing the annotation burden.
arXiv Detail & Related papers (2025-04-15T16:32:15Z)
LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition? [59.81732629438753]
We propose LLaVA-RadZ, a simple yet effective framework for zero-shot medical disease recognition via utilizing the existing MLLM features.<n>Specifically, we design an end-to-end training strategy, termed Decoding-Side Feature Alignment Training (DFAT) to take advantage of the characteristics of the MLLM decoder architecture.<n>We also introduce a Domain Knowledge Anchoring Module (DKAM) to exploit the intrinsic medical knowledge of large models.
arXiv Detail & Related papers (2025-03-10T16:05:40Z)
Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning [151.4060202671114]
multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing vision-language tasks.<n>This paper introduces a novel bottom-up reasoning framework to address hallucinations in MLLMs.<n>Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge.
arXiv Detail & Related papers (2024-12-15T09:10:46Z)
ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports. Based on this dataset, we focus on the challanging task of unsupervised pretraining. We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z)
How Does Diverse Interpretability of Textual Prompts Impact Medical Vision-Language Zero-Shot Tasks? [10.09105558197397]
Recent advancements in medical vision-language pre-training have significantly enhanced zero-shot medical vision tasks. The performance of these tasks can be heavily influenced by the variability in textual prompts describing the categories.
arXiv Detail & Related papers (2024-08-31T20:43:06Z)
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model [53.65682783591723]
Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks. However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information. In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning.
arXiv Detail & Related papers (2023-12-12T04:05:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.