Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models
- URL: http://arxiv.org/abs/2503.18556v1
- Date: Mon, 24 Mar 2025 11:09:06 GMT
- Title: Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models
- Authors: Bin Li, Dehong Gao, Yeyuan Wang, Linbo Jin, Shanqing Yu, Xiaoyan Cai, Libin Yang,
- Abstract summary: Large Vision-Language models (LVLMs) still suffer hallucinations when describing images, generating answers that include non-existent objects.<n>It is reported that these models tend to over-focus on certain irrelevant image tokens that do not contain critical information for answering the question.<n>We propose an Instruction-Aligned Visual Attention(IAVA) approach, which identifies irrelevant tokens by comparing changes in attention weights under two different instructions.
- Score: 14.739801223002262
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the significant success of Large Vision-Language models(LVLMs), these models still suffer hallucinations when describing images, generating answers that include non-existent objects. It is reported that these models tend to over-focus on certain irrelevant image tokens that do not contain critical information for answering the question and distort the output. To address this, we propose an Instruction-Aligned Visual Attention(IAVA) approach, which identifies irrelevant tokens by comparing changes in attention weights under two different instructions. By applying contrastive decoding, we dynamically adjust the logits generated from original image tokens and irrelevant image tokens, reducing the model's over-attention to irrelevant information. The experimental results demonstrate that IAVA consistently outperforms existing decoding techniques on benchmarks such as MME, POPE, and TextVQA in mitigating object hallucinations. Our IAVA approach is available online at https://github.com/Lee-lab558/IAVA.
Related papers
- Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images [7.823336661261962]
Large Vision-Language Models (VLMs) tend to neglect image content and over-rely on language-model priors.<n>We propose S-VCO (Symmetrical Visual Contrastive Optimization), a novel finetuning objective that steers the model toward capturing important visual details.
arXiv Detail & Related papers (2025-02-19T18:05:42Z) - Rethinking Homogeneity of Vision and Text Tokens in Large Vision-and-Language Models [29.611769371733672]
We propose De Attention (D-Attn), a novel method that processes visual and textual embeddings differently.
D-Attn diagonalizes visual-to-visual self-attention, reducing computation from $mathcalO(|V|2)$ to $mathcalO(|V|)$ for $|V|$ visual embeddings without compromising performance.
arXiv Detail & Related papers (2025-02-04T00:46:11Z) - PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model [0.0]
hallucinations often arise from the progressive weakening of attention weight to visual tokens.
textbfPAINT (textbfPaying textbfAttention to textbfINformed textbfTokens) is a plug-and-play framework that intervenes in the self-attention mechanism of the Large Vision Language Models.
arXiv Detail & Related papers (2025-01-21T15:22:31Z) - Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks.
LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension.
We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z) - Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions [31.637204677787576]
We introduce Knowledge Adapted (KnowAda) fine-tuning, a data-centric approach that automatically adapts training data with the model's existing knowledge and visual understanding.<n>KnowAda minimizes hallucinations while preserving high descriptiveness.<n>Our results show that KnowAda outperforms various baselines in both automatic metrics and human evaluations.
arXiv Detail & Related papers (2024-11-13T20:50:04Z) - KNN Transformer with Pyramid Prompts for Few-Shot Learning [52.735070934075736]
Few-Shot Learning aims to recognize new classes with limited labeled data.
Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features.
arXiv Detail & Related papers (2024-10-14T07:39:30Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models [16.185253476874006]
Excessive attention on a few image tokens, referred to as blind tokens, leads to hallucinatory responses in tasks requiring fine-grained understanding of visual objects.
We find that tokens receiving lower attention weights often hold essential information for identifying nuanced object details.
We introduce a technique called Attentional Vision (AVC) to counteract the over-emphasis on blind tokens.
arXiv Detail & Related papers (2024-05-28T04:40:57Z) - Aligning Modalities in Vision Large Language Models via Preference
Fine-tuning [67.62925151837675]
In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning.
Specifically, we propose POVID to generate feedback data with AI models.
We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data.
In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches.
arXiv Detail & Related papers (2024-02-18T00:56:16Z) - Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.