Attention to details, logits to truth: visual-aware attention and logits enhancement to mitigate hallucinations in LVLMs
- URL: http://arxiv.org/abs/2602.09521v1
- Date: Tue, 10 Feb 2026 08:26:50 GMT
- Title: Attention to details, logits to truth: visual-aware attention and logits enhancement to mitigate hallucinations in LVLMs
- Authors: Jingyi Wang, Fei Li, Rujie Liu,
- Abstract summary: We propose a training free attentional intervention algorithm to enhance the attention of task-relevant tokens.<n>To enhance the contribution of visual tokens, we inject visual attention values into the beam search decoding to identify solutions with higher visual attention.
- Score: 12.578567672069601
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing Large Vision-Language Models (LVLMs) exhibit insufficient visual attention, leading to hallucinations. To alleviate this problem, some previous studies adjust and amplify visual attention. These methods present a limitation that boosting attention for all visual tokens inevitably increases attention to task irrelevant tokens. To tackle this challenge, we propose a training free attentional intervention algorithm to enhance the attention of task-relevant tokens based on the argument that task-relevant tokens generally demonstrate high visual-textual similarities. Specifically, the vision-text cross-attention submatrices, which represent visual-textual correlations, are extracted to construct the reweighting matrices to reallocate attention. Besides, to enhance the contribution of visual tokens, we inject visual attention values into the beam search decoding to identify solutions with higher visual attention. Extensive experiments demonstrate that this method significantly reduces hallucinations across mainstream LVLMs, while preserving the accuracy and coherence of generated content.
Related papers
- AdaVBoost: Mitigating Hallucinations in LVLMs via Token-Level Adaptive Visual Attention Boosting [58.47431497789902]
Visual attention boosting has emerged as a promising direction for mitigating hallucinations in Large Vision-Language Models (LVLMs)<n>We propose AdaVBoost, a token-level visual attention boosting framework that adaptively determines how much attention to boost at each generation step.<n>We show that AdaVBoost significantly outperforms baseline methods across multiple LVLMs and hallucination benchmarks.
arXiv Detail & Related papers (2026-02-14T04:44:37Z) - Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation [8.805397340243557]
Vision language models (VLMs) often generate hallucination, i.e., content that cannot be substantiated by visual inputs.<n>We propose a method called Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT) to mitigate hallucination.
arXiv Detail & Related papers (2025-10-24T23:04:26Z) - Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning [79.34909830834464]
Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments.<n>We show that visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance.<n>We propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level.
arXiv Detail & Related papers (2025-09-08T09:20:04Z) - CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models [60.0300765815417]
Large Vision-Language Models (LVLMs) frequently produce content that deviates from visual information, leading to object hallucination.<n>We propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play hallucination mitigation method.
arXiv Detail & Related papers (2025-06-30T07:52:36Z) - Not All Tokens and Heads Are Equally Important: Dual-Level Attention Intervention for Hallucination Mitigation [46.3194503355054]
Large vision-language models (LVLMs) have demonstrated impressive capabilities across diverse multimodal tasks.<n>They remain highly susceptible to visual hallucinations (VH), often producing confident but inaccurate descriptions.<n>We introduce VisFlow, a framework that alleviates hallucinations by directly modulating attention patterns during inference.
arXiv Detail & Related papers (2025-06-14T19:10:22Z) - Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs [51.93737995405164]
Large Vision-Language Models (LVLMs) are susceptible to hallucinations.<n>We introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy.<n>We show that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
arXiv Detail & Related papers (2025-05-26T08:36:10Z) - Attention Hijackers: Detect and Disentangle Attention Hijacking in LVLMs for Hallucination Mitigation [123.54980913741828]
Large Vision-Language Models (LVLMs) remain vulnerable to hallucinations.<n>We propose a novel, training-free strategy namely Attention HIjackers Detection and Disentanglement (AID)<n>AID identifies Attention Hijackers by calculating instruction-driven visual salience.<n>Next, Attention Disentanglement mechanism is proposed to mask the visual attention of these identified Hijackers.<n>Re-Disentanglement recalculates the balance between instruction-driven and image-driven visual salience to avoid over-masking effects.
arXiv Detail & Related papers (2025-03-11T09:35:55Z) - See What You Are Told: Visual Attention Sink in Large Multimodal Models [4.024850952459758]
Large multimodal models (LMMs) "see" images by leveraging the attention mechanism between text and visual tokens in the transformer decoder.<n>Recent findings indicate that LMMs have an extraordinary tendency to consistently allocate high attention weights to specific visual tokens.<n>We introduce Visual Attention Redistribution ( VAR), a method that redistributes attention in image-centric heads.
arXiv Detail & Related papers (2025-03-05T09:55:07Z) - PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model [0.0]
hallucinations often arise from the progressive weakening of attention weight to visual tokens.<n>textbfPAINT (textbfPaying textbfAttention to textbfINformed textbfTokens) is a plug-and-play framework that intervenes in the self-attention mechanism of the Large Vision Language Models.
arXiv Detail & Related papers (2025-01-21T15:22:31Z) - Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence [69.86946427928511]
We investigate the internal mechanisms driving hallucination in large vision-language models (LVLMs)<n>We introduce Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of attention head outputs to visual context.<n>We propose Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate hallucination by enhancing the role of vision-aware attention heads.
arXiv Detail & Related papers (2024-12-18T15:29:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.