Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation
- URL: http://arxiv.org/abs/2510.22067v2
- Date: Mon, 10 Nov 2025 17:37:52 GMT
- Title: Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation
- Authors: Zheng Qi, Chao Shang, Evangelia Spiliopoulou, Nikolaos Pappas,
- Abstract summary: Vision language models (VLMs) often generate hallucination, i.e., content that cannot be substantiated by visual inputs.<n>We propose a method called Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT) to mitigate hallucination.
- Score: 8.805397340243557
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision language models (VLMs) often generate hallucination, i.e., content that cannot be substantiated by either textual or visual inputs. Prior work primarily attributes this to over-reliance on linguistic prior knowledge rather than visual inputs. Some methods attempt to mitigate hallucination by amplifying visual token attention proportionally to their attention scores. However, these methods overlook the visual attention sink problem, where attention is frequently misallocated to task-irrelevant visual regions, and neglect cross-modal fusion balance by enhancing only visual attention without adjusting attention to the user query. This can result in amplifying incorrect areas while failing to properly interpret the user query. To address these challenges, we propose a simple yet effective method called Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT). GIFT pre-computes a holistic visual saliency map by tracking positive changes in visual attention, or "gaze shifts", during user query comprehension, and leverages this map to amplify attention to both salient visual information and the user query at each decoding step. This reduces the impact of visual attention sink, as irrelevant tokens exhibit minimal shifts, while ensuring balanced cross-modal fusion for well-integrated representation. Extensive experiments show that GIFT effectively mitigates hallucination in VLMs across both generative and classification tasks, achieving up to 20.7% improvement over greedy decoding, while maintaining general vision-language performance with low computational overhead.
Related papers
- Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation [51.743225614196774]
Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language reasoning.<n>They remain vulnerable to hallucination, where generated content deviates from visual evidence.<n>Recent vision enhancement methods attempt to address this issue by reinforcing visual tokens during decoding.<n>We propose Adaptive Visual Reinforcement (AIR), a training-free framework for MLLMs.
arXiv Detail & Related papers (2026-02-27T14:18:51Z) - Attention to details, logits to truth: visual-aware attention and logits enhancement to mitigate hallucinations in LVLMs [12.578567672069601]
We propose a training free attentional intervention algorithm to enhance the attention of task-relevant tokens.<n>To enhance the contribution of visual tokens, we inject visual attention values into the beam search decoding to identify solutions with higher visual attention.
arXiv Detail & Related papers (2026-02-10T08:26:50Z) - Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning [79.34909830834464]
Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments.<n>We show that visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance.<n>We propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level.
arXiv Detail & Related papers (2025-09-08T09:20:04Z) - CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models [60.0300765815417]
Large Vision-Language Models (LVLMs) frequently produce content that deviates from visual information, leading to object hallucination.<n>We propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play hallucination mitigation method.
arXiv Detail & Related papers (2025-06-30T07:52:36Z) - Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs [9.406760867809124]
This paper introduces VISER (Visual Input Structure for Enhanced Reasoning), a simple yet effective intervention.<n>We empirically demonstrate substantial performance improvements across core visual reasoning tasks.<n>We find that low-level visual structuring is a powerful and underexplored direction for improving compositional visual reasoning.
arXiv Detail & Related papers (2025-06-27T11:44:40Z) - Not All Tokens and Heads Are Equally Important: Dual-Level Attention Intervention for Hallucination Mitigation [46.3194503355054]
Large vision-language models (LVLMs) have demonstrated impressive capabilities across diverse multimodal tasks.<n>They remain highly susceptible to visual hallucinations (VH), often producing confident but inaccurate descriptions.<n>We introduce VisFlow, a framework that alleviates hallucinations by directly modulating attention patterns during inference.
arXiv Detail & Related papers (2025-06-14T19:10:22Z) - Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding [12.82009632507056]
Existing vision-language models (VLMs) often suffer from visual hallucination, where the generated responses contain inaccuracies that are not grounded in the visual input.<n>We propose the Perception Magnifier (PM), a novel visual decoding method that iteratively isolates relevant visual tokens based on attention and magnifies the corresponding regions, spurring the model to concentrate on fine-grained visual details during decoding.
arXiv Detail & Related papers (2025-03-13T09:14:11Z) - Attention Hijackers: Detect and Disentangle Attention Hijacking in LVLMs for Hallucination Mitigation [123.54980913741828]
Large Vision-Language Models (LVLMs) remain vulnerable to hallucinations.<n>We propose a novel, training-free strategy namely Attention HIjackers Detection and Disentanglement (AID)<n>AID identifies Attention Hijackers by calculating instruction-driven visual salience.<n>Next, Attention Disentanglement mechanism is proposed to mask the visual attention of these identified Hijackers.<n>Re-Disentanglement recalculates the balance between instruction-driven and image-driven visual salience to avoid over-masking effects.
arXiv Detail & Related papers (2025-03-11T09:35:55Z) - Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.<n>LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.<n>We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z) - Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence [69.86946427928511]
We investigate the internal mechanisms driving hallucination in large vision-language models (LVLMs)<n>We introduce Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of attention head outputs to visual context.<n>We propose Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate hallucination by enhancing the role of vision-aware attention heads.
arXiv Detail & Related papers (2024-12-18T15:29:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.