Not All Tokens and Heads Are Equally Important: Dual-Level Attention Intervention for Hallucination Mitigation
- URL: http://arxiv.org/abs/2506.12609v2
- Date: Mon, 18 Aug 2025 08:18:50 GMT
- Title: Not All Tokens and Heads Are Equally Important: Dual-Level Attention Intervention for Hallucination Mitigation
- Authors: Lexiang Tang, Xianwei Zhuang, Bang Yang, Zhiyuan Hu, Hongxiang Li, Lu Ma, Jinghan Ru, Yuexian Zou,
- Abstract summary: Large vision-language models (LVLMs) have demonstrated impressive capabilities across diverse multimodal tasks.<n>They remain highly susceptible to visual hallucinations (VH), often producing confident but inaccurate descriptions.<n>We introduce VisFlow, a framework that alleviates hallucinations by directly modulating attention patterns during inference.
- Score: 46.3194503355054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large vision-language models (LVLMs) have demonstrated impressive capabilities across diverse multimodal tasks, yet they remain highly susceptible to visual hallucinations (VH), often producing confident but inaccurate descriptions of visual content. Building on the insight that not all tokens and attention heads contribute equally to VH mitigation, we introduce VisFlow, a lightweight and training-free framework that alleviates hallucinations by directly modulating attention patterns during inference. To address two primary challenges of VH, namely insufficient visual attention and the dominance of language priors, we identify three problematic attention behaviors in LVLMs: (1) disproportionate allocation of attention to uninformative or trailing visual tokens, (2) over-dependence on the previously generated token, and (3) excessive fixation on system prompts that hinders multimodal integration. To overcome these issues, VisFlow introduces a dual-level Attention Intervention, consisting of Token-level Attention Intervention (TAI), which reinforces attention to salient visual regions, and Head-level Attention Intervention (HAI), which suppresses undue focus on system prompts and adjacent text tokens. Together, these interventions strengthen visual alignment while reducing linguistic bias. Extensive experiments across diverse models and benchmarks demonstrate that VisFlow effectively mitigates hallucinations with minimal computational overhead.
Related papers
- Attention to details, logits to truth: visual-aware attention and logits enhancement to mitigate hallucinations in LVLMs [12.578567672069601]
We propose a training free attentional intervention algorithm to enhance the attention of task-relevant tokens.<n>To enhance the contribution of visual tokens, we inject visual attention values into the beam search decoding to identify solutions with higher visual attention.
arXiv Detail & Related papers (2026-02-10T08:26:50Z) - V-ITI: Mitigating Hallucinations in Multimodal Large Language Models via Visual Inference-Time Intervention [39.81171248046778]
Multimodal Large Language Models (MLLMs) excel in numerous vision-language tasks yet suffer from hallucinations.<n>We propose V-ITI, a lightweight visual inference-time intervention framework integrating a Visual Neglect Detector.<n>Experiments show V-ITI consistently mitigates vision-related hallucinations while preserving general task performance.
arXiv Detail & Related papers (2025-12-03T08:03:54Z) - Causally-Grounded Dual-Path Attention Intervention for Object Hallucination Mitigation in LVLMs [26.144870818163387]
We propose a framework that models hallucination process via a structural causal graph.<n>We introduce VTACR, a novel metric that quantifies the modality contribution imbalance during decoding.<n>We design a fine-language attention intervention mechanism that dynamically adjusts token- and layer-wise attention.
arXiv Detail & Related papers (2025-11-12T06:13:26Z) - Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation [8.805397340243557]
Vision language models (VLMs) often generate hallucination, i.e., content that cannot be substantiated by visual inputs.<n>We propose a method called Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT) to mitigate hallucination.
arXiv Detail & Related papers (2025-10-24T23:04:26Z) - HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models [60.028070589466445]
We propose HERO, a framework that integrates content-adaptive token budget allocation with function-aware token selection.<n>This study provides both empirical insights and practical solutions toward efficient inference in HR-LVLMs.
arXiv Detail & Related papers (2025-09-16T13:22:08Z) - Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning [79.34909830834464]
Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments.<n>We show that visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance.<n>We propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level.
arXiv Detail & Related papers (2025-09-08T09:20:04Z) - CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models [60.0300765815417]
Large Vision-Language Models (LVLMs) frequently produce content that deviates from visual information, leading to object hallucination.<n>We propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play hallucination mitigation method.
arXiv Detail & Related papers (2025-06-30T07:52:36Z) - Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression [6.838584336878126]
Large vision language models (LVLMs) often suffer from hallucinations, generating texts misaligned with the visual context.<n>Existing methods aimed at reducing hallucinations through inference time intervention incur a significant increase in latency.<n>We present SPIN, a task-agnostic attention-guided head suppression strategy that can be seamlessly integrated during inference.
arXiv Detail & Related papers (2025-05-22T09:00:57Z) - Aligning Attention Distribution to Information Flow for Hallucination Mitigation in Large Vision-Language Models [11.385588803559733]
We enhance the model's visual understanding by leveraging the core information embedded in semantic representations.<n>We evaluate our method on three image captioning benchmarks using five different LVLMs, demonstrating its effectiveness in significantly reducing hallucinations.
arXiv Detail & Related papers (2025-05-20T12:10:13Z) - The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering [42.09744951074433]
We investigate the internal dynamics of hallucination by examining the tokens logits rankings throughout the generation process.<n>We propose VISTA, a training-free inference-time intervention framework that reduces hallucination while promoting genuine information.
arXiv Detail & Related papers (2025-02-05T21:34:02Z) - PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model [0.0]
hallucinations often arise from the progressive weakening of attention weight to visual tokens.<n>textbfPAINT (textbfPaying textbfAttention to textbfINformed textbfTokens) is a plug-and-play framework that intervenes in the self-attention mechanism of the Large Vision Language Models.
arXiv Detail & Related papers (2025-01-21T15:22:31Z) - Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence [69.86946427928511]
We investigate the internal mechanisms driving hallucination in large vision-language models (LVLMs)<n>We introduce Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of attention head outputs to visual context.<n>We propose Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate hallucination by enhancing the role of vision-aware attention heads.
arXiv Detail & Related papers (2024-12-18T15:29:30Z) - CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs [74.36850397755572]
CATCH addresses issues related to visual defects that cause diminished fine-grained feature perception and cumulative hallucinations in open-ended scenarios.
It is applicable to various visual question-answering tasks without requiring any specific data or prior knowledge, and generalizes robustly to new tasks without additional training.
arXiv Detail & Related papers (2024-11-19T18:27:31Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - Visual Perturbation-aware Collaborative Learning for Overcoming the
Language Prior Problem [60.0878532426877]
We propose a novel collaborative learning scheme from the viewpoint of visual perturbation calibration.
Specifically, we devise a visual controller to construct two sorts of curated images with different perturbation extents.
The experimental results on two diagnostic VQA-CP benchmark datasets evidently demonstrate its effectiveness.
arXiv Detail & Related papers (2022-07-24T23:50:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.