Related papers: Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration

Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration

URL: http://arxiv.org/abs/2505.21472v1
Date: Tue, 27 May 2025 17:45:21 GMT
Title: Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration
Authors: Mehrdad Fazli, Bowen Wei, Ziwei Zhu,
Abstract summary: Large vision-language models (LVLMs) achieve impressive performance on multimodal tasks but often suffer from hallucination.<n>We introduce the Confidence-Aware Attention framework to address this challenge.
Score: 1.7373859011890633
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large vision-language models (LVLMs) achieve impressive performance on multimodal tasks but often suffer from hallucination, and confidently describe objects or attributes not present in the image. Current inference-time interventions, while training-free, struggle to maintain accuracy in open-ended and long-form generation scenarios. We introduce the Confidence-Aware Attention Calibration (CAAC) framework to address this challenge by targeting two key biases: spatial perception bias, which distributes attention disproportionately across image tokens, and modality bias, which shifts focus from visual to textual inputs over time. CAAC employs a two-step approach: Visual-Token Calibration (VTC) to balance attention across visual tokens, and Adaptive Attention Re-Scaling (AAR) to reinforce visual grounding based on the model's confidence. This confidence-driven adjustment ensures consistent visual alignment during generation. Experiments on CHAIR, AMBER, and POPE benchmarks demonstrate that CAAC outperforms baselines, particularly in long-form generations, effectively reducing hallucination.

Related papers

Not All Tokens and Heads Are Equally Important: Dual-Level Attention Intervention for Hallucination Mitigation [46.3194503355054]
Large vision-language models (LVLMs) have shown remarkable capabilities across a wide range of multimodal tasks.<n>They remain prone to visual hallucination (VH), often producing confident but incorrect descriptions of visual content.<n>We present VisFlow, an efficient and training-free framework designed to mitigate VH by directly manipulating attention patterns during inference.
arXiv Detail & Related papers (2025-06-14T19:10:22Z)
Gaze-Guided Learning: Avoiding Shortcut Bias in Visual Classification [3.1208151315473622]
We introduce Gaze-CIFAR-10, a human gaze time-series dataset, along with a dual-sequence gaze encoder.<n>In parallel, a Vision Transformer (ViT) is employed to learn the sequential representation of image content.<n>Our framework integrates human gaze priors with machine-derived visual sequences, effectively correcting inaccurate localization in image feature representations.
arXiv Detail & Related papers (2025-04-08T00:40:46Z)
Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models [85.51753014478315]
We introduce AdaptPrune, a novel plug-and-play training-free pruning method.<n>It builds on conventional attention-based pruning by integrating spatial distance and token similarity with an adaptive NMS approach.<n>Our approach ensures a comprehensive evaluation of token importance and substantially refines the pruning decisions.
arXiv Detail & Related papers (2025-03-11T03:58:17Z)
Mitigating Object Hallucinations in Large Vision-Language Models via Attention Calibration [22.39558434131574]
Large Vision-Language Models (LVLMs) generate responses that are not factually aligned with the visual content.<n>We introduce a training-free solution, Uniform Attention (UAC), that estimates the bias from single meaningless input image.<n>We also introduce a fine-tuning solution, Dynamic Attention (DAC), that enforces the consistent outputs wherever the object locates in the image.
arXiv Detail & Related papers (2025-02-04T03:27:38Z)
PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model [0.0]
hallucinations often arise from the progressive weakening of attention weight to visual tokens.<n>textbfPAINT (textbfPaying textbfAttention to textbfINformed textbfTokens) is a plug-and-play framework that intervenes in the self-attention mechanism of the Large Vision Language Models.
arXiv Detail & Related papers (2025-01-21T15:22:31Z)
Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.<n>LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.<n>We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z)
Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning [5.242869847419834]
Few-shot, fine-grained classification in computer vision poses significant challenges due to the need to differentiate subtle class distinctions with limited data.<n>This paper presents a novel method that enhances the Contrastive Language-Image Pre-Training model through adaptive prompt tuning.
arXiv Detail & Related papers (2024-12-19T08:51:01Z)
Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence [69.86946427928511]
We investigate the internal mechanisms driving hallucination in large vision-language models (LVLMs)<n>We introduce Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of attention head outputs to visual context.<n>We propose Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate hallucination by enhancing the role of vision-aware attention heads.
arXiv Detail & Related papers (2024-12-18T15:29:30Z)
Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models [16.185253476874006]
Large Vision Language Models (LVLMs) demonstrate strong capabilities in visual understanding and description, yet often suffer from hallucinations.<n>We propose Attentional Vision (AvisC), a test-time approach that recalibrates the influence of blind tokens without modifying the underlying attention mechanism.<n>Experiments on standard benchmarks, including POPE, MME, and AMBER, demonstrate that AvisC effectively reduces hallucinations in LVLMs.
arXiv Detail & Related papers (2024-05-28T04:40:57Z)
Dual-Image Enhanced CLIP for Zero-Shot Anomaly Detection [58.228940066769596]
We introduce a Dual-Image Enhanced CLIP approach, leveraging a joint vision-language scoring system. Our methods process pairs of images, utilizing each as a visual reference for the other, thereby enriching the inference process with visual context. Our approach significantly exploits the potential of vision-language joint anomaly detection and demonstrates comparable performance with current SOTA methods across various datasets.
arXiv Detail & Related papers (2024-05-08T03:13:20Z)
More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints [63.08768589044052]
We propose Contrastive Content Re-sourcing ( CCR) and Contrastive Content Swapping ( CCS) constraints to address such limitation. CCR and CCS constraints supervise the training of attention models in a contrastive learning manner without requiring explicit attention annotations. Experiments on both Flickr30k and MS-COCO datasets demonstrate that integrating these attention constraints into two state-of-the-art attention-based models improves the model performance.
arXiv Detail & Related papers (2021-05-20T08:48:10Z)
Visual Alignment Constraint for Continuous Sign Language Recognition [74.26707067455837]
Vision-based Continuous Sign Language Recognition aims to recognize unsegmented gestures from image sequences. In this work, we revisit the overfitting problem in recent CTC-based CSLR works and attribute it to the insufficient training of the feature extractor. We propose a Visual Alignment Constraint (VAC) to enhance the feature extractor with more alignment supervision.
arXiv Detail & Related papers (2021-04-06T07:24:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.