Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models
- URL: http://arxiv.org/abs/2405.17820v1
- Date: Tue, 28 May 2024 04:40:57 GMT
- Title: Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models
- Authors: Sangmin Woo, Donguk Kim, Jaehyuk Jang, Yubin Choi, Changick Kim,
- Abstract summary: Excessive attention on a few image tokens, referred to as blind tokens, leads to hallucinatory responses in tasks requiring fine-grained understanding of visual objects.
We find that tokens receiving lower attention weights often hold essential information for identifying nuanced object details.
We introduce a technique called Attentional Vision (AVC) to counteract the over-emphasis on blind tokens.
- Score: 16.185253476874006
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study addresses the issue observed in Large Vision Language Models (LVLMs), where excessive attention on a few image tokens, referred to as blind tokens, leads to hallucinatory responses in tasks requiring fine-grained understanding of visual objects. We found that tokens receiving lower attention weights often hold essential information for identifying nuanced object details -- ranging from merely recognizing object existence to identifying their attributes (color, position, etc.) and understanding their relationships. To counteract the over-emphasis on blind tokens and to accurately respond to user queries, we introduce a technique called Attentional Vision Calibration (AVC). During the decoding phase, AVC identifies blind tokens by analyzing the image-related attention distribution. It then dynamically adjusts the logits for the next token prediction by contrasting the logits conditioned on the original visual tokens with those conditioned on the blind tokens. This effectively lowers the dependency on blind tokens and promotes a more balanced consideration of all tokens. We validate AVC on benchmarks such as POPE, MME, and AMBER, where it consistently outperforms existing decoding techniques in mitigating object hallucinations in LVLMs.
Related papers
- Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models [14.739801223002262]
Large Vision-Language models (LVLMs) still suffer hallucinations when describing images, generating answers that include non-existent objects.
It is reported that these models tend to over-focus on certain irrelevant image tokens that do not contain critical information for answering the question.
We propose an Instruction-Aligned Visual Attention(IAVA) approach, which identifies irrelevant tokens by comparing changes in attention weights under two different instructions.
arXiv Detail & Related papers (2025-03-24T11:09:06Z) - Attention Hijackers: Detect and Disentangle Attention Hijacking in LVLMs for Hallucination Mitigation [123.54980913741828]
Large Vision-Language Models (LVLMs) remain vulnerable to hallucinations.
We propose a novel, training-free strategy namely Attention HIjackers Detection and Disentanglement (AID)
AID identifies Attention Hijackers by calculating instruction-driven visual salience.
Next, Attention Disentanglement mechanism is proposed to mask the visual attention of these identified Hijackers.
Re-Disentanglement recalculates the balance between instruction-driven and image-driven visual salience to avoid over-masking effects.
arXiv Detail & Related papers (2025-03-11T09:35:55Z) - See What You Are Told: Visual Attention Sink in Large Multimodal Models [4.024850952459758]
Large multimodal models (LMMs) "see" images by leveraging the attention mechanism between text and visual tokens in the transformer decoder.
Recent findings indicate that LMMs have an extraordinary tendency to consistently allocate high attention weights to specific visual tokens.
We introduce Visual Attention Redistribution ( VAR), a method that redistributes attention in image-centric heads.
arXiv Detail & Related papers (2025-03-05T09:55:07Z) - Introducing Visual Perception Token into Multimodal Large Language Model [53.82301522384719]
Multimodal Large Language Model (MLLM) relies on the perception process of its vision encoder.
MLLM still lacks the autonomous capability to control its own visual perception processes.
We propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes.
arXiv Detail & Related papers (2025-02-24T18:56:12Z) - Visual Attention Never Fades: Selective Progressive Attention ReCalibration for Detailed Image Captioning in Multimodal Large Language Models [35.49886398402627]
We propose a training-free method that enhances the contribution of visual tokens during decoding.
Our experiments, incorporating both automated and human evaluations, demonstrate that existing methods improve the precision of MLLMs at the cost of recall.
arXiv Detail & Related papers (2025-02-03T14:58:11Z) - PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model [0.0]
hallucinations often arise from the progressive weakening of attention weight to visual tokens.
textbfPAINT (textbfPaying textbfAttention to textbfINformed textbfTokens) is a plug-and-play framework that intervenes in the self-attention mechanism of the Large Vision Language Models.
arXiv Detail & Related papers (2025-01-21T15:22:31Z) - [CLS] Token Tells Everything Needed for Training-free Efficient MLLMs [66.5266435598799]
Multi-language Large Language Models (MLLMs) have recently demonstrated strong performance across a wide range of vision tasks.
However, their efficient deployment remains a substantial challenge due to high computational costs and memory requirements.
We introduce a simple yet effective method for train-free visual compression, called VTC- compression.
arXiv Detail & Related papers (2024-12-08T05:29:39Z) - FoPru: Focal Pruning for Efficient Large Vision-Language Models [11.36025001578531]
We propose Focal Pruning (FoPru), a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder.
Our method can prune a large number of redundant tokens while maintaining high accuracy, leading to significant improvements in inference efficiency.
arXiv Detail & Related papers (2024-11-21T14:22:38Z) - Mitigating Object Hallucination via Concentric Causal Attention [71.27325347912823]
We show that object hallucination is closely tied with Rotary Position.
RoPE, a widely adopted positional dependency modeling design.
We propose Concentric Causal Attention (CCA), a simple yet effective positional alignment strategy.
arXiv Detail & Related papers (2024-10-21T11:54:53Z) - KNN Transformer with Pyramid Prompts for Few-Shot Learning [52.735070934075736]
Few-Shot Learning aims to recognize new classes with limited labeled data.
Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features.
arXiv Detail & Related papers (2024-10-14T07:39:30Z) - VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens.
Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z) - ToSA: Token Selective Attention for Efficient Vision Transformers [50.13756218204456]
ToSA is a token selective attention approach that can identify tokens that need to be attended as well as those that can skip a transformer layer.
We show that ToSA can significantly reduce computation costs while maintaining accuracy on the ImageNet classification benchmark.
arXiv Detail & Related papers (2024-06-13T05:17:21Z) - LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models [35.88374542519597]
Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model.
Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which further increases the number of visual tokens significantly.
We propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising the performance of LMMs.
arXiv Detail & Related papers (2024-03-22T17:59:52Z) - Visual Concepts Tokenization [65.61987357146997]
We propose an unsupervised transformer-based Visual Concepts Tokenization framework, dubbed VCT, to perceive an image into a set of disentangled visual concept tokens.
To obtain these concept tokens, we only use cross-attention to extract visual information from the image tokens layer by layer without self-attention between concept tokens.
We further propose a Concept Disentangling Loss to facilitate that different concept tokens represent independent visual concepts.
arXiv Detail & Related papers (2022-05-20T11:25:31Z) - TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? [89.17394772676819]
We introduce a novel visual representation learning which relies on a handful of adaptively learned tokens.
Our experiments demonstrate strong performance on several challenging benchmarks for both image and video recognition tasks.
arXiv Detail & Related papers (2021-06-21T17:55:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.