V-ITI: Mitigating Hallucinations in Multimodal Large Language Models via Visual Inference-Time Intervention
- URL: http://arxiv.org/abs/2512.03542v1
- Date: Wed, 03 Dec 2025 08:03:54 GMT
- Title: V-ITI: Mitigating Hallucinations in Multimodal Large Language Models via Visual Inference-Time Intervention
- Authors: Nan Sun, Zhenyu Zhang, Xixun Lin, Kun Wang, Yanmin Shang, Naibin Gu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang, Yanan Cao,
- Abstract summary: Multimodal Large Language Models (MLLMs) excel in numerous vision-language tasks yet suffer from hallucinations.<n>We propose V-ITI, a lightweight visual inference-time intervention framework integrating a Visual Neglect Detector.<n>Experiments show V-ITI consistently mitigates vision-related hallucinations while preserving general task performance.
- Score: 39.81171248046778
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) excel in numerous vision-language tasks yet suffer from hallucinations, producing content inconsistent with input visuals, that undermine reliability in precision-sensitive domains. This issue stems from a fundamental problem of visual neglect, where models fail to adequately prioritize input images. Existing methods typically alleviate hallucinations by intervening in the attention score or output logits, focusing on "how to intervene" but overlooking the prerequisite "when to intervene", which leads to the "over-intervention" problem and subsequently introduces new hallucinations and unnecessary computational overhead. To address this gap, we first investigate the mechanism of visual neglect and reveal it can be accurately detected via head-level activation patterns in MLLMs. We thus propose V-ITI, a lightweight visual inference-time intervention framework integrating a Visual Neglect Detector that identifies visual neglect via head-level discriminative probes and a Visual Recall Intervenor that modulates activations with prestored visual activation information only when the visual neglect is detected. Extensive experiments across eight benchmarks and different MLLM families demonstrate that V-ITI consistently mitigates vision-related hallucinations while preserving general task performance.
Related papers
- CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models [60.0300765815417]
Large Vision-Language Models (LVLMs) frequently produce content that deviates from visual information, leading to object hallucination.<n>We propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play hallucination mitigation method.
arXiv Detail & Related papers (2025-06-30T07:52:36Z) - Not All Tokens and Heads Are Equally Important: Dual-Level Attention Intervention for Hallucination Mitigation [46.3194503355054]
Large vision-language models (LVLMs) have demonstrated impressive capabilities across diverse multimodal tasks.<n>They remain highly susceptible to visual hallucinations (VH), often producing confident but inaccurate descriptions.<n>We introduce VisFlow, a framework that alleviates hallucinations by directly modulating attention patterns during inference.
arXiv Detail & Related papers (2025-06-14T19:10:22Z) - Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.<n>LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.<n>We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z) - Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence [69.86946427928511]
We investigate the internal mechanisms driving hallucination in large vision-language models (LVLMs)<n>We introduce Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of attention head outputs to visual context.<n>We propose Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate hallucination by enhancing the role of vision-aware attention heads.
arXiv Detail & Related papers (2024-12-18T15:29:30Z) - Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning [151.4060202671114]
multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing vision-language tasks.<n>This paper introduces a novel bottom-up reasoning framework to address hallucinations in MLLMs.<n>Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge.
arXiv Detail & Related papers (2024-12-15T09:10:46Z) - CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs [74.36850397755572]
CATCH addresses issues related to visual defects that cause diminished fine-grained feature perception and cumulative hallucinations in open-ended scenarios.
It is applicable to various visual question-answering tasks without requiring any specific data or prior knowledge, and generalizes robustly to new tasks without additional training.
arXiv Detail & Related papers (2024-11-19T18:27:31Z) - Reducing Hallucinations in Vision-Language Models via Latent Space Steering [34.1755878632361]
Hallucination poses a challenge to the deployment of large vision-language models (LVLMs) in applications.
We introduce Visual and Textual Intervention (VTI), a novel technique designed to reduce hallucinations by steering latent space representations during inference to enhance the stability of vision features.
arXiv Detail & Related papers (2024-10-21T08:42:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.