Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models
- URL: http://arxiv.org/abs/2405.17820v2
- Date: Sat, 31 May 2025 02:41:34 GMT
- Title: Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models
- Authors: Sangmin Woo, Donguk Kim, Jaehyuk Jang, Yubin Choi, Changick Kim,
- Abstract summary: Large Vision Language Models (LVLMs) demonstrate strong capabilities in visual understanding and description, yet often suffer from hallucinations.<n>We propose Attentional Vision (AvisC), a test-time approach that recalibrates the influence of blind tokens without modifying the underlying attention mechanism.<n>Experiments on standard benchmarks, including POPE, MME, and AMBER, demonstrate that AvisC effectively reduces hallucinations in LVLMs.
- Score: 16.185253476874006
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Vision Language Models (LVLMs) demonstrate strong capabilities in visual understanding and description, yet often suffer from hallucinations, attributing incorrect or misleading features to images. We observe that LVLMs disproportionately focus on a small subset of image tokens--termed blind tokens--which are typically irrelevant to the query (e.g., background or non-object regions). We hypothesize that such attention misalignment plays a key role in generating hallucinated responses. To mitigate this issue, we propose Attentional Vision Calibration (AvisC), a test-time approach that dynamically recalibrates the influence of blind tokens without modifying the underlying attention mechanism. AvisC first identifies blind tokens by analyzing layer-wise attention distributions over image tokens, then employs a contrastive decoding strategy to balance the influence of original and blind-token-biased logits. Experiments on standard benchmarks, including POPE, MME, and AMBER, demonstrate that AvisC effectively reduces hallucinations in LVLMs.
Related papers
- CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models [60.0300765815417]
Large Vision-Language Models (LVLMs) frequently produce content that deviates from visual information, leading to object hallucination.<n>We propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play hallucination mitigation method.
arXiv Detail & Related papers (2025-06-30T07:52:36Z) - Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs [42.871396640891334]
Large Vision-Language Models (LVLMs) are susceptible to hallucinations.<n>We introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy.<n>We show that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
arXiv Detail & Related papers (2025-05-26T08:36:10Z) - Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression [6.838584336878126]
Large vision language models (LVLMs) often suffer from hallucinations, generating texts misaligned with the visual context.<n>Existing methods aimed at reducing hallucinations through inference time intervention incur a significant increase in latency.<n>We present SPIN, a task-agnostic attention-guided head suppression strategy that can be seamlessly integrated during inference.
arXiv Detail & Related papers (2025-05-22T09:00:57Z) - Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models [14.739801223002262]
Large Vision-Language models (LVLMs) still suffer hallucinations when describing images, generating answers that include non-existent objects.
It is reported that these models tend to over-focus on certain irrelevant image tokens that do not contain critical information for answering the question.
We propose an Instruction-Aligned Visual Attention(IAVA) approach, which identifies irrelevant tokens by comparing changes in attention weights under two different instructions.
arXiv Detail & Related papers (2025-03-24T11:09:06Z) - Attention Reallocation: Towards Zero-cost and Controllable Hallucination Mitigation of MLLMs [62.9348974370985]
We propose attention reallocation (AttnReal) to mitigate hallucinations with nearly zero extra cost.<n>Our approach is motivated by the key observations that, MLLM's unreasonable attention distribution causes features to be dominated by historical output tokens.<n>Based on the observations, AttnReal recycles excessive attention from output tokens and reallocates it to visual tokens, which reduces MLLM's reliance on language priors.
arXiv Detail & Related papers (2025-03-11T11:52:37Z) - Attention Hijackers: Detect and Disentangle Attention Hijacking in LVLMs for Hallucination Mitigation [123.54980913741828]
Large Vision-Language Models (LVLMs) remain vulnerable to hallucinations.
We propose a novel, training-free strategy namely Attention HIjackers Detection and Disentanglement (AID)
AID identifies Attention Hijackers by calculating instruction-driven visual salience.
Next, Attention Disentanglement mechanism is proposed to mask the visual attention of these identified Hijackers.
Re-Disentanglement recalculates the balance between instruction-driven and image-driven visual salience to avoid over-masking effects.
arXiv Detail & Related papers (2025-03-11T09:35:55Z) - See What You Are Told: Visual Attention Sink in Large Multimodal Models [4.024850952459758]
Large multimodal models (LMMs) "see" images by leveraging the attention mechanism between text and visual tokens in the transformer decoder.
Recent findings indicate that LMMs have an extraordinary tendency to consistently allocate high attention weights to specific visual tokens.
We introduce Visual Attention Redistribution ( VAR), a method that redistributes attention in image-centric heads.
arXiv Detail & Related papers (2025-03-05T09:55:07Z) - Introducing Visual Perception Token into Multimodal Large Language Model [53.82301522384719]
Multimodal Large Language Model (MLLM) relies on the perception process of its vision encoder.
MLLM still lacks the autonomous capability to control its own visual perception processes.
We propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes.
arXiv Detail & Related papers (2025-02-24T18:56:12Z) - Visual Attention Never Fades: Selective Progressive Attention ReCalibration for Detailed Image Captioning in Multimodal Large Language Models [35.49886398402627]
We propose a training-free method that enhances the contribution of visual tokens during decoding.
Our experiments, incorporating both automated and human evaluations, demonstrate that existing methods improve the precision of MLLMs at the cost of recall.
arXiv Detail & Related papers (2025-02-03T14:58:11Z) - MINT: Mitigating Hallucinations in Large Vision-Language Models via Token Reduction [6.416957959150438]
Hallucinations hinder the application of Large Vision-Language Models (LVLMs) in domains that require high reliability.<n>We propose MINT, a training-free decoding strategy, MItigating hallucinations via tokeN reducTion.<n>Our approach achieves a 4% improvement in mitigating hallucinations caused by distracted perception compared to original models.
arXiv Detail & Related papers (2025-02-02T08:34:57Z) - PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model [0.0]
hallucinations often arise from the progressive weakening of attention weight to visual tokens.
textbfPAINT (textbfPaying textbfAttention to textbfINformed textbfTokens) is a plug-and-play framework that intervenes in the self-attention mechanism of the Large Vision Language Models.
arXiv Detail & Related papers (2025-01-21T15:22:31Z) - Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Models [57.58426038241812]
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance in complex multimodal tasks.<n>These models still suffer from hallucinations when required to implicitly recognize or infer diverse visual entities from images.<n>We propose a novel visual question answering (VQA) benchmark that employs contextual reasoning prompts as hallucination attacks.
arXiv Detail & Related papers (2024-12-29T23:56:01Z) - Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence [69.86946427928511]
We investigate the internal mechanisms driving hallucination in large vision-language models (LVLMs)<n>We introduce Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of attention head outputs to visual context.<n>We propose Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate hallucination by enhancing the role of vision-aware attention heads.
arXiv Detail & Related papers (2024-12-18T15:29:30Z) - [CLS] Token Tells Everything Needed for Training-free Efficient MLLMs [66.5266435598799]
Multi-language Large Language Models (MLLMs) have recently demonstrated strong performance across a wide range of vision tasks.
However, their efficient deployment remains a substantial challenge due to high computational costs and memory requirements.
We introduce a simple yet effective method for train-free visual compression, called VTC- compression.
arXiv Detail & Related papers (2024-12-08T05:29:39Z) - FoPru: Focal Pruning for Efficient Large Vision-Language Models [11.36025001578531]
We propose Focal Pruning (FoPru), a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder.
Our method can prune a large number of redundant tokens while maintaining high accuracy, leading to significant improvements in inference efficiency.
arXiv Detail & Related papers (2024-11-21T14:22:38Z) - CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs [74.36850397755572]
CATCH addresses issues related to visual defects that cause diminished fine-grained feature perception and cumulative hallucinations in open-ended scenarios.
It is applicable to various visual question-answering tasks without requiring any specific data or prior knowledge, and generalizes robustly to new tasks without additional training.
arXiv Detail & Related papers (2024-11-19T18:27:31Z) - Mitigating Object Hallucination via Concentric Causal Attention [71.27325347912823]
We show that object hallucination is closely tied with Rotary Position.
RoPE, a widely adopted positional dependency modeling design.
We propose Concentric Causal Attention (CCA), a simple yet effective positional alignment strategy.
arXiv Detail & Related papers (2024-10-21T11:54:53Z) - KNN Transformer with Pyramid Prompts for Few-Shot Learning [52.735070934075736]
Few-Shot Learning aims to recognize new classes with limited labeled data.
Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features.
arXiv Detail & Related papers (2024-10-14T07:39:30Z) - VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens.
Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z) - ToSA: Token Selective Attention for Efficient Vision Transformers [50.13756218204456]
ToSA is a token selective attention approach that can identify tokens that need to be attended as well as those that can skip a transformer layer.
We show that ToSA can significantly reduce computation costs while maintaining accuracy on the ImageNet classification benchmark.
arXiv Detail & Related papers (2024-06-13T05:17:21Z) - LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models [35.88374542519597]
Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model.
Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which further increases the number of visual tokens significantly.
We propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising the performance of LMMs.
arXiv Detail & Related papers (2024-03-22T17:59:52Z) - Visual Concepts Tokenization [65.61987357146997]
We propose an unsupervised transformer-based Visual Concepts Tokenization framework, dubbed VCT, to perceive an image into a set of disentangled visual concept tokens.
To obtain these concept tokens, we only use cross-attention to extract visual information from the image tokens layer by layer without self-attention between concept tokens.
We further propose a Concept Disentangling Loss to facilitate that different concept tokens represent independent visual concepts.
arXiv Detail & Related papers (2022-05-20T11:25:31Z) - TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? [89.17394772676819]
We introduce a novel visual representation learning which relies on a handful of adaptively learned tokens.
Our experiments demonstrate strong performance on several challenging benchmarks for both image and video recognition tasks.
arXiv Detail & Related papers (2021-06-21T17:55:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.