Anatomical Region-Guided Contrastive Decoding: A Plug-and-Play Strategy for Mitigating Hallucinations in Medical VLMs
- URL: http://arxiv.org/abs/2512.17189v1
- Date: Fri, 19 Dec 2025 03:11:20 GMT
- Title: Anatomical Region-Guided Contrastive Decoding: A Plug-and-Play Strategy for Mitigating Hallucinations in Medical VLMs
- Authors: Xiao Liang, Chenxi Liu, Zhi Ma, Di Wang, Bin Jing, Quan Wang, Yuanyuan Shi,
- Abstract summary: Anatomical Region-Guided Contrastive Decoding (ARCD) is a plug-and-play strategy that mitigates hallucinations by providing targeted, region-specific guidance.<n>Our method is effective in improving regional understanding, reducing hallucinations, and enhancing overall diagnostic accuracy.
- Score: 20.507007953026346
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Medical Vision-Language Models (MedVLMs) show immense promise in clinical applicability. However, their reliability is hindered by hallucinations, where models often fail to derive answers from visual evidence, instead relying on learned textual priors. Existing mitigation strategies for MedVLMs have distinct limitations: training-based methods rely on costly expert annotations, limiting scalability, while training-free interventions like contrastive decoding, though data-efficient, apply a global, untargeted correction whose effects in complex real-world clinical settings can be unreliable. To address these challenges, we introduce Anatomical Region-Guided Contrastive Decoding (ARCD), a plug-and-play strategy that mitigates hallucinations by providing targeted, region-specific guidance. Our module leverages an anatomical mask to direct a three-tiered contrastive decoding process. By dynamically re-weighting at the token, attention, and logits levels, it verifiably steers the model's focus onto specified regions, reinforcing anatomical understanding and suppressing factually incorrect outputs. Extensive experiments across diverse datasets, including chest X-ray, CT, brain MRI, and ocular ultrasound, demonstrate our method's effectiveness in improving regional understanding, reducing hallucinations, and enhancing overall diagnostic accuracy.
Related papers
- XAI-CLIP: ROI-Guided Perturbation Framework for Explainable Medical Image Segmentation in Multimodal Vision-Language Models [4.5236257764997205]
XAI-CLIP is an ROI-guided perturbation framework for medical image segmentation.<n>It integrates language-informed region localization with medical image segmentation and applying targeted, region-aware perturbations.<n>XAI-CLIP achieves up to a 60% reduction in runtime, a 44.6% improvement in dice score, and a 96.7% increase in Intersection-over-Union.
arXiv Detail & Related papers (2026-02-01T00:27:06Z) - Self-Supervised Anatomical Consistency Learning for Vision-Grounded Medical Report Generation [61.350584471060756]
Vision-grounded medical report generation aims to produce clinically accurate descriptions of medical images.<n>We propose Self-Supervised Anatomical Consistency Learning (SS-ACL) to align generated reports with corresponding anatomical regions.<n>SS-ACL constructs a hierarchical anatomical graph inspired by the invariant top-down inclusion structure of human anatomy.
arXiv Detail & Related papers (2025-09-30T08:59:06Z) - Self-Supervised Cross-Encoder for Neurodegenerative Disease Diagnosis [6.226851122403944]
We propose a novel self-supervised cross-encoder framework that leverages the temporal continuity in longitudinal MRI scans for supervision.<n>This framework disentangles learned representations into two components: a static representation, constrained by contrastive learning, which captures stable anatomical features; and a dynamic representation, guided by input-gradient regularization, which reflects temporal changes.<n> Experimental results on the Alzheimer's Disease Neuroimaging Initiative dataset demonstrate that our method achieves superior classification accuracy and improved interpretability.
arXiv Detail & Related papers (2025-09-09T11:52:24Z) - FoundDiff: Foundational Diffusion Model for Generalizable Low-Dose CT Denoising [55.04342933312839]
We propose FoundDiff, a foundational diffusion model for unified and generalizable low-dose computed tomography (CT) denoising.<n>FoundDiff employs a two-stage strategy: (i) dose-anatomy perception and (ii) adaptive denoising.<n>First, we develop a dose- and anatomy-aware contrastive language image pre-training model (DA-CLIP) to achieve robust dose and anatomy perception.<n>Second, we design a dose- and anatomy-aware diffusion model (DA-Diff) to perform adaptive and generalizable denoising.
arXiv Detail & Related papers (2025-08-24T11:03:56Z) - Towards Accurate and Interpretable Neuroblastoma Diagnosis via Contrastive Multi-scale Pathological Image Analysis [16.268045905735818]
We propose CMSwinKAN, a contrastive-learning-based multi-scale feature fusion model tailored for pathological image classification.<n>By fusing multi-scale features and leveraging contrastive learning strategies, CMSwinKAN mimics clinicians' comprehensive approach.<n>Results demonstrate that CMSwinKAN performs better than existing state-of-the-art pathology-specific models pre-trained on large datasets.
arXiv Detail & Related papers (2025-04-18T15:39:46Z) - From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation [48.45209969191245]
Vision-language models (VLMs) provide semantic context through textual descriptions but lack explanation precision required.<n>We propose a teacher-student framework that integrates both gaze and language supervision, leveraging their complementary strengths.<n>Our method achieves Dice scores of 80.78%, 80.53%, and 84.22%, respectively, improving 3-5% over gaze baselines without increasing the annotation burden.
arXiv Detail & Related papers (2025-04-15T16:32:15Z) - Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.<n>LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.<n>We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z) - FOCUS: Knowledge-enhanced Adaptive Visual Compression for Few-shot Whole Slide Image Classification [4.148491257542209]
Few-shot learning presents a critical solution for cancer diagnosis in computational pathology.<n>A key challenge in this paradigm stems from the inherent disparity between the limited training set of whole slide images (WSIs) and the enormous number of contained patches.<n>We introduce the knowledge-enhanced adaptive visual compression framework, dubbed FOCUS, to enable a focused analysis of diagnostically relevant regions.
arXiv Detail & Related papers (2024-11-22T05:36:38Z) - Improving Multiple Sclerosis Lesion Segmentation Across Clinical Sites:
A Federated Learning Approach with Noise-Resilient Training [75.40980802817349]
Deep learning models have shown promise for automatically segmenting MS lesions, but the scarcity of accurately annotated data hinders progress in this area.
We introduce a Decoupled Hard Label Correction (DHLC) strategy that considers the imbalanced distribution and fuzzy boundaries of MS lesions.
We also introduce a Centrally Enhanced Label Correction (CELC) strategy, which leverages the aggregated central model as a correction teacher for all sites.
arXiv Detail & Related papers (2023-08-31T00:36:10Z) - Toward Robust Diagnosis: A Contour Attention Preserving Adversarial
Defense for COVID-19 Detection [10.953610196636784]
We propose a Contour Attention Preserving (CAP) method based on lung cavity edge extraction.
Experimental results indicate that the proposed method achieves state-of-the-art performance in multiple adversarial defense and generalization tasks.
arXiv Detail & Related papers (2022-11-30T08:01:23Z) - Few-shot Medical Image Segmentation using a Global Correlation Network
with Discriminative Embedding [60.89561661441736]
We propose a novel method for few-shot medical image segmentation.
We construct our few-shot image segmentor using a deep convolutional network trained episodically.
We enhance discriminability of deep embedding to encourage clustering of the feature domains of the same class.
arXiv Detail & Related papers (2020-12-10T04:01:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.