Not All Tokens and Heads Are Equally Important: Dual-Level Attention Intervention for Hallucination Mitigation
- URL: http://arxiv.org/abs/2506.12609v1
- Date: Sat, 14 Jun 2025 19:10:22 GMT
- Title: Not All Tokens and Heads Are Equally Important: Dual-Level Attention Intervention for Hallucination Mitigation
- Authors: Lexiang Tang, Xianwei Zhuang, Bang Yang, Zhiyuan Hu, Hongxiang Li, Lu Ma, Jinghan Ru, Yuexian Zou,
- Abstract summary: Large vision-language models (LVLMs) have shown remarkable capabilities across a wide range of multimodal tasks.<n>They remain prone to visual hallucination (VH), often producing confident but incorrect descriptions of visual content.<n>We present VisFlow, an efficient and training-free framework designed to mitigate VH by directly manipulating attention patterns during inference.
- Score: 46.3194503355054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large vision-language models (LVLMs) have shown remarkable capabilities across a wide range of multimodal tasks. However, they remain prone to visual hallucination (VH), often producing confident but incorrect descriptions of visual content. We present VisFlow, an efficient and training-free framework designed to mitigate VH by directly manipulating attention patterns during inference. Through systematic analysis, we identify three key pathological attention behaviors in LVLMs: (1) weak visual grounding, where attention to visual tokens is insufficient or misallocated, over-focusing on uninformative regions; (2) language prior dominance, where excessive attention to prior response tokens reinforces autoregressive patterns and impairs multimodal alignment; (3) prompt redundancy, where many attention heads fixate on system prompt tokens, disrupting the integration of image, instruction, and response content. To address these issues, we introduce two inference-time interventions: token-level attention intervention (TAI), which enhances focus on salient visual content, and head-level attention intervention (HAI), which suppresses over-attention to prompt and nearby text tokens. VisFlow operates without additional training or model modifications. Extensive experiments across models and benchmarks show that VisFlow effectively reduces hallucinations and improves visual factuality, with negligible computational cost.
Related papers
- Attention to details, logits to truth: visual-aware attention and logits enhancement to mitigate hallucinations in LVLMs [12.578567672069601]
We propose a training free attentional intervention algorithm to enhance the attention of task-relevant tokens.<n>To enhance the contribution of visual tokens, we inject visual attention values into the beam search decoding to identify solutions with higher visual attention.
arXiv Detail & Related papers (2026-02-10T08:26:50Z) - V-ITI: Mitigating Hallucinations in Multimodal Large Language Models via Visual Inference-Time Intervention [39.81171248046778]
Multimodal Large Language Models (MLLMs) excel in numerous vision-language tasks yet suffer from hallucinations.<n>We propose V-ITI, a lightweight visual inference-time intervention framework integrating a Visual Neglect Detector.<n>Experiments show V-ITI consistently mitigates vision-related hallucinations while preserving general task performance.
arXiv Detail & Related papers (2025-12-03T08:03:54Z) - Causally-Grounded Dual-Path Attention Intervention for Object Hallucination Mitigation in LVLMs [26.144870818163387]
We propose a framework that models hallucination process via a structural causal graph.<n>We introduce VTACR, a novel metric that quantifies the modality contribution imbalance during decoding.<n>We design a fine-language attention intervention mechanism that dynamically adjusts token- and layer-wise attention.
arXiv Detail & Related papers (2025-11-12T06:13:26Z) - Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation [8.805397340243557]
Vision language models (VLMs) often generate hallucination, i.e., content that cannot be substantiated by visual inputs.<n>We propose a method called Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT) to mitigate hallucination.
arXiv Detail & Related papers (2025-10-24T23:04:26Z) - HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models [60.028070589466445]
We propose HERO, a framework that integrates content-adaptive token budget allocation with function-aware token selection.<n>This study provides both empirical insights and practical solutions toward efficient inference in HR-LVLMs.
arXiv Detail & Related papers (2025-09-16T13:22:08Z) - Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning [79.34909830834464]
Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments.<n>We show that visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance.<n>We propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level.
arXiv Detail & Related papers (2025-09-08T09:20:04Z) - CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models [60.0300765815417]
Large Vision-Language Models (LVLMs) frequently produce content that deviates from visual information, leading to object hallucination.<n>We propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play hallucination mitigation method.
arXiv Detail & Related papers (2025-06-30T07:52:36Z) - Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression [6.838584336878126]
Large vision language models (LVLMs) often suffer from hallucinations, generating texts misaligned with the visual context.<n>Existing methods aimed at reducing hallucinations through inference time intervention incur a significant increase in latency.<n>We present SPIN, a task-agnostic attention-guided head suppression strategy that can be seamlessly integrated during inference.
arXiv Detail & Related papers (2025-05-22T09:00:57Z) - Aligning Attention Distribution to Information Flow for Hallucination Mitigation in Large Vision-Language Models [11.385588803559733]
We enhance the model's visual understanding by leveraging the core information embedded in semantic representations.<n>We evaluate our method on three image captioning benchmarks using five different LVLMs, demonstrating its effectiveness in significantly reducing hallucinations.
arXiv Detail & Related papers (2025-05-20T12:10:13Z) - The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering [42.09744951074433]
We investigate the internal dynamics of hallucination by examining the tokens logits rankings throughout the generation process.<n>We propose VISTA, a training-free inference-time intervention framework that reduces hallucination while promoting genuine information.
arXiv Detail & Related papers (2025-02-05T21:34:02Z) - PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model [0.0]
hallucinations often arise from the progressive weakening of attention weight to visual tokens.<n>textbfPAINT (textbfPaying textbfAttention to textbfINformed textbfTokens) is a plug-and-play framework that intervenes in the self-attention mechanism of the Large Vision Language Models.
arXiv Detail & Related papers (2025-01-21T15:22:31Z) - Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence [69.86946427928511]
We investigate the internal mechanisms driving hallucination in large vision-language models (LVLMs)<n>We introduce Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of attention head outputs to visual context.<n>We propose Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate hallucination by enhancing the role of vision-aware attention heads.
arXiv Detail & Related papers (2024-12-18T15:29:30Z) - CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs [74.36850397755572]
CATCH addresses issues related to visual defects that cause diminished fine-grained feature perception and cumulative hallucinations in open-ended scenarios.
It is applicable to various visual question-answering tasks without requiring any specific data or prior knowledge, and generalizes robustly to new tasks without additional training.
arXiv Detail & Related papers (2024-11-19T18:27:31Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - Visual Perturbation-aware Collaborative Learning for Overcoming the
Language Prior Problem [60.0878532426877]
We propose a novel collaborative learning scheme from the viewpoint of visual perturbation calibration.
Specifically, we devise a visual controller to construct two sorts of curated images with different perturbation extents.
The experimental results on two diagnostic VQA-CP benchmark datasets evidently demonstrate its effectiveness.
arXiv Detail & Related papers (2022-07-24T23:50:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.