Related papers: Seeing It or Not? Interpretable Vision-aware Latent Steering to Mitigate Object Hallucinations

Seeing It or Not? Interpretable Vision-aware Latent Steering to Mitigate Object Hallucinations

URL: http://arxiv.org/abs/2505.17812v1
Date: Fri, 23 May 2025 12:29:00 GMT
Title: Seeing It or Not? Interpretable Vision-aware Latent Steering to Mitigate Object Hallucinations
Authors: Boxu Chen, Ziwei Zheng, Le Yang, Zeyu Geng, Zhengyu Zhao, Chenhao Lin, Chao Shen,
Abstract summary: Large Vision-Language Models (LVLMs) have achieved remarkable success but continue to struggle with object hallucination (OH)<n>We propose VaLSe, a Vision-aware Latent Steering framework that adopts an interpretation-then-mitigation strategy to address OH in LVLMs.
Score: 11.474045796965056
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable success but continue to struggle with object hallucination (OH), generating outputs inconsistent with visual inputs. While previous work has proposed methods to reduce OH, the visual decision-making mechanisms that lead to hallucinations remain poorly understood. In this paper, we propose VaLSe, a Vision-aware Latent Steering framework that adopts an interpretation-then-mitigation strategy to address OH in LVLMs. By tackling dual challenges of modeling complex vision-language interactions and eliminating spurious activation artifacts, VaLSe can generate visual contribution maps that trace how specific visual inputs influence individual output tokens. These maps reveal the model's vision-aware focus regions, which are then used to perform latent space steering, realigning internal representations toward semantically relevant content and reducing hallucinated outputs. Extensive experiments demonstrate that VaLSe is a powerful interpretability tool and an effective method for enhancing model robustness against OH across multiple benchmarks. Furthermore, our analysis uncovers limitations in existing OH evaluation metrics, underscoring the need for more nuanced, interpretable, and visually grounded OH benchmarks in future work. Code is available at: https://github.com/Ziwei-Zheng/VaLSe.

Related papers

IKOD: Mitigating Visual Attention Degradation in Large Vision-Language Models [20.036659182106806]
We show that Large Vision-Language Models (LVLMs) exhibit a long-term bias where hallucinations increase as the sequence length grows.<n>We propose Image attention-guided Key-value merging cOllaborative Decoding (IKOD), a collaborative decoding strategy generating more image-focused sequences.
arXiv Detail & Related papers (2025-08-05T14:05:15Z)
A Comprehensive Analysis for Visual Object Hallucination in Large Vision-Language Models [30.037505914306504]
Vision-Language Models (LVLMs) demonstrate remarkable capabilities in multimodal tasks.<n>LVLMs generate inaccurate visual object-related information based on the query input, potentially leading to misinformation and concerns about safety and reliability.<n>In this paper, we analyze each component of LLaVA-like LVLMs to identify potential sources of error and their impact.
arXiv Detail & Related papers (2025-05-04T01:47:58Z)
Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models [66.71616369573715]
Large Vision-Language Models (LVLMs) are prone to generating hallucinatory text responses that do not align with the given visual input.<n>We introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process.
arXiv Detail & Related papers (2025-02-10T03:43:55Z)
Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.<n>LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.<n>We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z)
Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence [69.86946427928511]
We investigate the internal mechanisms driving hallucination in large vision-language models (LVLMs)<n>We introduce Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of attention head outputs to visual context.<n>We propose Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate hallucination by enhancing the role of vision-aware attention heads.
arXiv Detail & Related papers (2024-12-18T15:29:30Z)
Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning [151.4060202671114]
multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing vision-language tasks.<n>This paper introduces a novel bottom-up reasoning framework to address hallucinations in MLLMs.<n>Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge.
arXiv Detail & Related papers (2024-12-15T09:10:46Z)
VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding [38.23310445372371]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal task reasoning.<n>They often generate responses that appear plausible yet do not accurately reflect the visual content, a phenomenon known as hallucination.<n>Recent approaches have introduced training-free methods to mitigate hallucinations by adjusting the decoding strategy during the inference stage.<n>We propose a novel hallucination-mitigation method from the visual encoding perspective: textbfVisutextbfal textbfLayer Fustextbfion Contrastive textbfD
arXiv Detail & Related papers (2024-11-24T13:42:02Z)
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs. Existing benchmarks are often limited in scope, focusing mainly on object hallucinations. We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z)
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding [30.30494071474536]
HALC is a novel decoding algorithm designed to mitigate object hallucinations (OH) in large vision-language models (LVLMs) HALC integrates a robust auto-focal grounding mechanism (locally) to correct hallucinated tokens on the fly, and a specialized beam search algorithm (globally) to significantly reduce OH.
arXiv Detail & Related papers (2024-03-01T10:21:52Z)
Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance [51.30560006045442]
Image-gRounded guIdaNcE (MARINE) is a framework that is both training-free and API-free.<n>MARINE effectively and efficiently reduces object hallucinations during inference by introducing image-grounded guidance to LVLMs.<n>Our framework's flexibility further allows for the integration of multiple vision models, enabling more reliable and robust object-level guidance.
arXiv Detail & Related papers (2024-02-13T18:59:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.