Not All Tokens and Heads Are Equally Important: Dual-Level Attention Intervention for Hallucination Mitigation
- URL: http://arxiv.org/abs/2506.12609v1
- Date: Sat, 14 Jun 2025 19:10:22 GMT
- Title: Not All Tokens and Heads Are Equally Important: Dual-Level Attention Intervention for Hallucination Mitigation
- Authors: Lexiang Tang, Xianwei Zhuang, Bang Yang, Zhiyuan Hu, Hongxiang Li, Lu Ma, Jinghan Ru, Yuexian Zou,
- Abstract summary: Large vision-language models (LVLMs) have shown remarkable capabilities across a wide range of multimodal tasks.<n>They remain prone to visual hallucination (VH), often producing confident but incorrect descriptions of visual content.<n>We present VisFlow, an efficient and training-free framework designed to mitigate VH by directly manipulating attention patterns during inference.
- Score: 46.3194503355054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large vision-language models (LVLMs) have shown remarkable capabilities across a wide range of multimodal tasks. However, they remain prone to visual hallucination (VH), often producing confident but incorrect descriptions of visual content. We present VisFlow, an efficient and training-free framework designed to mitigate VH by directly manipulating attention patterns during inference. Through systematic analysis, we identify three key pathological attention behaviors in LVLMs: (1) weak visual grounding, where attention to visual tokens is insufficient or misallocated, over-focusing on uninformative regions; (2) language prior dominance, where excessive attention to prior response tokens reinforces autoregressive patterns and impairs multimodal alignment; (3) prompt redundancy, where many attention heads fixate on system prompt tokens, disrupting the integration of image, instruction, and response content. To address these issues, we introduce two inference-time interventions: token-level attention intervention (TAI), which enhances focus on salient visual content, and head-level attention intervention (HAI), which suppresses over-attention to prompt and nearby text tokens. VisFlow operates without additional training or model modifications. Extensive experiments across models and benchmarks show that VisFlow effectively reduces hallucinations and improves visual factuality, with negligible computational cost.
Related papers
- CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models [60.0300765815417]
Large Vision-Language Models (LVLMs) frequently produce content that deviates from visual information, leading to object hallucination.<n>We propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play hallucination mitigation method.
arXiv Detail & Related papers (2025-06-30T07:52:36Z) - Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression [6.838584336878126]
Large vision language models (LVLMs) often suffer from hallucinations, generating texts misaligned with the visual context.<n>Existing methods aimed at reducing hallucinations through inference time intervention incur a significant increase in latency.<n>We present SPIN, a task-agnostic attention-guided head suppression strategy that can be seamlessly integrated during inference.
arXiv Detail & Related papers (2025-05-22T09:00:57Z) - Aligning Attention Distribution to Information Flow for Hallucination Mitigation in Large Vision-Language Models [11.385588803559733]
We enhance the model's visual understanding by leveraging the core information embedded in semantic representations.<n>We evaluate our method on three image captioning benchmarks using five different LVLMs, demonstrating its effectiveness in significantly reducing hallucinations.
arXiv Detail & Related papers (2025-05-20T12:10:13Z) - The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering [42.09744951074433]
We investigate the internal dynamics of hallucination by examining the tokens logits rankings throughout the generation process.<n>We propose VISTA, a training-free inference-time intervention framework that reduces hallucination while promoting genuine information.
arXiv Detail & Related papers (2025-02-05T21:34:02Z) - PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model [0.0]
hallucinations often arise from the progressive weakening of attention weight to visual tokens.<n>textbfPAINT (textbfPaying textbfAttention to textbfINformed textbfTokens) is a plug-and-play framework that intervenes in the self-attention mechanism of the Large Vision Language Models.
arXiv Detail & Related papers (2025-01-21T15:22:31Z) - Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence [69.86946427928511]
We investigate the internal mechanisms driving hallucination in large vision-language models (LVLMs)<n>We introduce Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of attention head outputs to visual context.<n>We propose Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate hallucination by enhancing the role of vision-aware attention heads.
arXiv Detail & Related papers (2024-12-18T15:29:30Z) - CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs [74.36850397755572]
CATCH addresses issues related to visual defects that cause diminished fine-grained feature perception and cumulative hallucinations in open-ended scenarios.
It is applicable to various visual question-answering tasks without requiring any specific data or prior knowledge, and generalizes robustly to new tasks without additional training.
arXiv Detail & Related papers (2024-11-19T18:27:31Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - Visual Perturbation-aware Collaborative Learning for Overcoming the
Language Prior Problem [60.0878532426877]
We propose a novel collaborative learning scheme from the viewpoint of visual perturbation calibration.
Specifically, we devise a visual controller to construct two sorts of curated images with different perturbation extents.
The experimental results on two diagnostic VQA-CP benchmark datasets evidently demonstrate its effectiveness.
arXiv Detail & Related papers (2022-07-24T23:50:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.