Attention Hijackers: Detect and Disentangle Attention Hijacking in LVLMs for Hallucination Mitigation
- URL: http://arxiv.org/abs/2503.08216v2
- Date: Fri, 14 Mar 2025 02:03:16 GMT
- Title: Attention Hijackers: Detect and Disentangle Attention Hijacking in LVLMs for Hallucination Mitigation
- Authors: Beitao Chen, Xinyu Lyu, Lianli Gao, Jingkuan Song, Heng Tao Shen,
- Abstract summary: Large Vision-Language Models (LVLMs) remain vulnerable to hallucinations.<n>We propose a novel, training-free strategy namely Attention HIjackers Detection and Disentanglement (AID)<n>AID identifies Attention Hijackers by calculating instruction-driven visual salience.<n>Next, Attention Disentanglement mechanism is proposed to mask the visual attention of these identified Hijackers.<n>Re-Disentanglement recalculates the balance between instruction-driven and image-driven visual salience to avoid over-masking effects.
- Score: 123.54980913741828
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite their success, Large Vision-Language Models (LVLMs) remain vulnerable to hallucinations. While existing studies attribute the cause of hallucinations to insufficient visual attention to image tokens, our findings indicate that hallucinations also arise from interference from instruction tokens during decoding. Intuitively, certain instruction tokens continuously distort LVLMs' visual perception during decoding, hijacking their visual attention toward less discriminative visual regions. This distortion prevents them integrating broader contextual information from images, ultimately leading to hallucinations. We term this phenomenon 'Attention Hijacking', where disruptive instruction tokens act as 'Attention Hijackers'. To address this, we propose a novel, training-free strategy namely Attention HIjackers Detection and Disentanglement (AID), designed to isolate the influence of Hijackers, enabling LVLMs to rely on their context-aware intrinsic attention map. Specifically, AID consists of three components: First, Attention Hijackers Detection identifies Attention Hijackers by calculating instruction-driven visual salience. Next, Attention Disentanglement mechanism is proposed to mask the visual attention of these identified Hijackers, and thereby mitigate their disruptive influence on subsequent tokens. Finally, Re-Disentanglement recalculates the balance between instruction-driven and image-driven visual salience to avoid over-masking effects. Extensive experiments demonstrate that AID significantly reduces hallucination across various LVLMs on several benchmarks.
Related papers
- TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection [6.006482486396196]
We propose Temporal Attention Real-time Accumulative Connection (TARAC) to mitigate hallucinations caused by the decay of attention on image tokens.
We validate TARAC across multiple models and datasets, demonstrating that our approach substantially mitigates hallucinations.
arXiv Detail & Related papers (2025-04-05T07:57:11Z) - Attention Reallocation: Towards Zero-cost and Controllable Hallucination Mitigation of MLLMs [62.9348974370985]
We propose attention reallocation (AttnReal) to mitigate hallucinations with nearly zero extra cost.
Our approach is motivated by the key observations that, MLLM's unreasonable attention distribution causes features to be dominated by historical output tokens.
Based on the observations, AttnReal recycles excessive attention from output tokens and reallocates it to visual tokens, which reduces MLLM's reliance on language priors.
arXiv Detail & Related papers (2025-03-11T11:52:37Z) - Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence [69.86946427928511]
We investigate the internal mechanisms driving hallucination in large vision-language models (LVLMs)<n>We introduce Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of attention head outputs to visual context.<n>We propose Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate hallucination by enhancing the role of vision-aware attention heads.
arXiv Detail & Related papers (2024-12-18T15:29:30Z) - Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens [7.806633929976787]
Hallucinations in Large Vision-Language Models (LVLMs) significantly undermine their reliability.<n>This paper addresses how LVLMs process visual information and whether this process causes hallucination.<n>We propose a simple inference-time method that adjusts visual attention by integrating information across various heads.
arXiv Detail & Related papers (2024-11-23T03:40:05Z) - CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs [74.36850397755572]
CATCH addresses issues related to visual defects that cause diminished fine-grained feature perception and cumulative hallucinations in open-ended scenarios.
It is applicable to various visual question-answering tasks without requiring any specific data or prior knowledge, and generalizes robustly to new tasks without additional training.
arXiv Detail & Related papers (2024-11-19T18:27:31Z) - Mitigating Object Hallucination via Concentric Causal Attention [71.27325347912823]
We show that object hallucination is closely tied with Rotary Position.
RoPE, a widely adopted positional dependency modeling design.
We propose Concentric Causal Attention (CCA), a simple yet effective positional alignment strategy.
arXiv Detail & Related papers (2024-10-21T11:54:53Z) - Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models [30.26685485474035]
Large Vision-Language Models (LVLMs) have rapidly advanced in recent years.
The prevalent issue known as the hallucination' problem has emerged as a significant bottleneck.
We propose a simple yet effective method named Self-Introspective Decoding (SID)
arXiv Detail & Related papers (2024-08-04T13:50:17Z) - Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs [52.497823009176074]
Large Vision-Language Models (LVLMs) often produce responses that misalign with factual information, a phenomenon known as hallucinations.<n>We introduce Visual Description Grounded Decoding (VDGD), a training-free method designed to enhance visual perception and improve reasoning capabilities in LVLMs.
arXiv Detail & Related papers (2024-05-24T16:21:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.