Attention Reallocation: Towards Zero-cost and Controllable Hallucination Mitigation of MLLMs
- URL: http://arxiv.org/abs/2503.08342v2
- Date: Wed, 12 Mar 2025 04:18:48 GMT
- Title: Attention Reallocation: Towards Zero-cost and Controllable Hallucination Mitigation of MLLMs
- Authors: Chongjun Tu, Peng Ye, Dongzhan Zhou, Lei Bai, Gang Yu, Tao Chen, Wanli Ouyang,
- Abstract summary: We propose attention reallocation (AttnReal) to mitigate hallucinations with nearly zero extra cost.<n>Our approach is motivated by the key observations that, MLLM's unreasonable attention distribution causes features to be dominated by historical output tokens.<n>Based on the observations, AttnReal recycles excessive attention from output tokens and reallocates it to visual tokens, which reduces MLLM's reliance on language priors.
- Score: 62.9348974370985
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-Modal Large Language Models (MLLMs) stand out in various tasks but still struggle with hallucinations. While recent training-free mitigation methods mostly introduce additional inference overhead via retrospection strategy and contrastive decoding, we propose attention reallocation (AttnReal) to mitigate hallucinations with nearly zero extra cost. Our approach is motivated by the key observations that, MLLM's unreasonable attention distribution causes features to be dominated by historical output tokens, which further contributes to hallucinated responses because of the distribution gap between different token types. Based on the observations, AttnReal recycles excessive attention from output tokens and reallocates it to visual tokens, which reduces MLLM's reliance on language priors and ensures the decoding process depends more on the visual inputs. More interestingly, we find that, by controlling the intensity of AttnReal, we can achieve a wide-range trade-off between the response faithfulness and overall performance. Comprehensive results from different benchmarks validate the effectiveness of AttnReal across six open-source MLLMs and three decoding strategies.
Related papers
- Modality Bias in LVLMs: Analyzing and Mitigating Object Hallucination via Attention Lens [0.0]
Large vision-language models (LVLMs) have demonstrated remarkable multimodal comprehension and reasoning capabilities.<n>LVLMs tend to over-rely on textual prompts and internal knowledge of large language models, generating descriptions inconsistent with visual cues.<n>We propose a training-free method to mitigate object hallucination.
arXiv Detail & Related papers (2025-08-04T13:40:59Z) - Token Activation Map to Visually Explain Multimodal LLMs [23.774995444587667]
We propose an estimated causal inference method to mitigate the interference of context to achieve high-quality MLLM explanation.<n>We term this method Token Activation Map (TAM) to highlight the consideration of interactions between tokens.<n>Our TAM method significantly outperforms existing SoTA methods, showcasing high-quality visualization results.
arXiv Detail & Related papers (2025-06-29T14:50:45Z) - Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding [33.33247964758369]
We argue that adequate contextual information can be extracted directly from the token interaction process.<n>Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens.<n>We present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens.
arXiv Detail & Related papers (2025-05-22T13:19:57Z) - Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.
LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.
We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z) - Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning [151.4060202671114]
multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing vision-language tasks.<n>This paper introduces a novel bottom-up reasoning framework to address hallucinations in MLLMs.<n>Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge.
arXiv Detail & Related papers (2024-12-15T09:10:46Z) - Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings [66.04061083611863]
Excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation.<n>We propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE)<n>DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer.
arXiv Detail & Related papers (2024-11-29T11:24:23Z) - Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention Causality [20.41579586967349]
Multimodal Large Language Models (MLLMs) have emerged as a central focus in both industry and academia.
MLLMs often suffer from biases introduced by visual and language priors, which can lead to multimodal hallucination.
We propose a causal inference framework termed CausalMM that applies structural causal modeling to MLLMs.
arXiv Detail & Related papers (2024-10-07T06:45:22Z) - Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - OPERA: Alleviating Hallucination in Multi-Modal Large Language Models
via Over-Trust Penalty and Retrospection-Allocation [124.9008419182485]
We present OPERA, a novel MLLM decoding method grounded in an Over-trust Penalty and a Retrospection-Allocation strategy.
Our approach begins with an interesting observation that, most hallucinations are closely tied to the knowledge aggregation patterns in the self-attention matrix.
Based on the observation, OPERA introduces a penalty term on the model logits during the beam-search decoding to mitigate the over-trust issue.
arXiv Detail & Related papers (2023-11-29T18:57:07Z) - Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus [99.33091772494751]
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields.
LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations.
We propose a novel reference-free, uncertainty-based method for detecting hallucinations in LLMs.
arXiv Detail & Related papers (2023-11-22T08:39:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.