Perceive and Calibrate: Analyzing and Enhancing Robustness of Medical Multi-Modal Large Language Models
- URL: http://arxiv.org/abs/2512.21964v1
- Date: Fri, 26 Dec 2025 10:23:30 GMT
- Title: Perceive and Calibrate: Analyzing and Enhancing Robustness of Medical Multi-Modal Large Language Models
- Authors: Dunyuan XU, Xikai Yang, Yaoqian Li, Juzheng Miao, Jinpeng Li, Pheng-Ann Heng,
- Abstract summary: This work systematically analyzes the impact of various perturbations on medical MLLMs.<n>For the visual modality, we propose a Perturbation-aware Denoising (PDC) which leverages MLLMs' own vision encoder to identify noise patterns.<n>For text denoising, we design a Self-instantiated Multi-agent System (SMS) that exploits the MLLMs' self-assessment capabilities to refine noisy text.
- Score: 43.46006663176283
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical Multi-modal Large Language Models (MLLMs) have shown promising clinical performance. However, their sensitivity to real-world input perturbations, such as imaging artifacts and textual errors, critically undermines their clinical applicability. Systematic analysis of such noise impact on medical MLLMs remains largely unexplored. Furthermore, while several works have investigated the MLLMs' robustness in general domains, they primarily focus on text modality and rely on costly fine-tuning. They are inadequate to address the complex noise patterns and fulfill the strict safety standards in medicine. To bridge this gap, this work systematically analyzes the impact of various perturbations on medical MLLMs across both visual and textual modalities. Building on our findings, we introduce a training-free Inherent-enhanced Multi-modal Calibration (IMC) framework that leverages MLLMs' inherent denoising capabilities following the perceive-and-calibrate principle for cross-modal robustness enhancement. For the visual modality, we propose a Perturbation-aware Denoising Calibration (PDC) which leverages MLLMs' own vision encoder to identify noise patterns and perform prototype-guided feature calibration. For text denoising, we design a Self-instantiated Multi-agent System (SMS) that exploits the MLLMs' self-assessment capabilities to refine noisy text through a cooperative hierarchy of agents. We construct a benchmark containing 11 types of noise across both image and text modalities on 2 datasets. Experimental results demonstrate our method achieves the state-of-the-art performance across multiple modalities, showing potential to enhance MLLMs' robustness in real clinical scenarios.
Related papers
- A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models [85.30893355216486]
We study how visual token redundancy evolves with different dMLLM architectures and tasks.<n>Our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks.<n>Layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs.
arXiv Detail & Related papers (2025-11-19T04:13:36Z) - Analysis of Image-and-Text Uncertainty Propagation in Multimodal Large Language Models with Cardiac MR-Based Applications [10.096013178241117]
Multimodal large language models (MLLMs) can process and integrate information from multimodality sources, such as text and images.<n> uncertainties due to individual uni-modal data and potential clinical applications are yet fully understood.<n>We propose a multimodal uncertainty propagation model (MUPM) based on uncertainty propagation.
arXiv Detail & Related papers (2025-07-17T09:34:21Z) - Medical large language models are easily distracted [0.8211696054238238]
Large language models (LLMs) have the potential to transform medicine, but real-world clinical scenarios contain extraneous information that can hinder performance.<n>We developed MedDistractQA, a benchmark using USMLE-style questions embedded with simulated real-world distractions.
arXiv Detail & Related papers (2025-04-01T21:34:01Z) - LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition? [59.81732629438753]
We propose LLaVA-RadZ, a simple yet effective framework for zero-shot medical disease recognition via utilizing the existing MLLM features.<n>Specifically, we design an end-to-end training strategy, termed Decoding-Side Feature Alignment Training (DFAT) to take advantage of the characteristics of the MLLM decoder architecture.<n>We also introduce a Domain Knowledge Anchoring Module (DKAM) to exploit the intrinsic medical knowledge of large models.
arXiv Detail & Related papers (2025-03-10T16:05:40Z) - Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.<n>LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.<n>We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z) - Mitigating Hallucinations of Large Language Models in Medical Information Extraction via Contrastive Decoding [92.32881381717594]
We introduce ALternate Contrastive Decoding (ALCD) to solve hallucination issues in medical information extraction tasks.
ALCD demonstrates significant improvements in resolving hallucination issues compared to conventional decoding methods.
arXiv Detail & Related papers (2024-10-21T07:19:19Z) - LLM4Brain: Training a Large Language Model for Brain Video Understanding [9.294352205183726]
We introduce an LLM-based approach for reconstructing visual-semantic information from fMRI signals elicited by video stimuli.
We employ fine-tuning techniques on an fMRI encoder equipped with adaptors to transform brain responses into latent representations aligned with the video stimuli.
In particular, we integrate self-supervised domain adaptation methods to enhance the alignment between visual-semantic information and brain responses.
arXiv Detail & Related papers (2024-09-26T15:57:08Z) - From Training-Free to Adaptive: Empirical Insights into MLLMs' Understanding of Detection Information [32.57246173437492]
Vision detection models excel at recognizing fine-grained image details.<n>One effective strategy is to infuse detection information in text format, which has proven simple and effective.<n>This paper addresses the question: How does training impact MLLMs' understanding of infused textual detection information?
arXiv Detail & Related papers (2024-01-31T16:38:32Z) - Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning [67.0609518552321]
We propose to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models.
By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner.
arXiv Detail & Related papers (2023-12-05T07:29:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.