Related papers: Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs

Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs

URL: http://arxiv.org/abs/2510.17771v1
Date: Mon, 20 Oct 2025 17:31:09 GMT
Title: Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs
Authors: Zhining Liu, Ziyi Chen, Hui Liu, Chen Luo, Xianfeng Tang, Suhang Wang, Joy Zeng, Zhenwei Dai, Zhan Shi, Tianxin Wei, Benoit Dumoulin, Hanghang Tong,
Abstract summary: Vision-Language Models (VLMs) achieve strong results on multimodal tasks such as visual question answering, yet they can still fail even when the correct visual evidence is present.<n>We show that shallow layers focus primarily on text, while deeper layers sparsely but reliably attend to localized evidence regions.<n>We introduce an inference-time intervention that highlights deep-layer evidence regions through selective attention-based masking.
Score: 72.8370367403852
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) achieve strong results on multimodal tasks such as visual question answering, yet they can still fail even when the correct visual evidence is present. In this work, we systematically investigate whether these failures arise from not perceiving the evidence or from not leveraging it effectively. By examining layer-wise attention dynamics, we find that shallow layers focus primarily on text, while deeper layers sparsely but reliably attend to localized evidence regions. Surprisingly, VLMs often perceive the visual evidence when outputting incorrect answers, a phenomenon we term ``seeing but not believing'' that widely exists in major VLM families. Building on this, we introduce an inference-time intervention that highlights deep-layer evidence regions through selective attention-based masking. It requires no training and consistently improves accuracy across multiple families, including LLaVA, Qwen, Gemma, and InternVL. These results show that VLMs encode reliable evidence internally but under-utilize it, making such signals explicit can bridge the gap between perception and reasoning, advancing the diagnostic understanding and reliability of VLMs.

Related papers

Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs [55.61018839017648]
Chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks.<n>Existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies.<n>We propose SAYO, a visual reasoning model trained with a reinforcement learning framework that introduces a region-level visual attention-based reward.
arXiv Detail & Related papers (2026-02-09T03:33:23Z)
Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification [27.02252748004729]
Large vision-language models (LVLMs) have shown substantial advances in multimodal understanding and generation.<n>They frequently produce unreliable or even harmful content, such as fact hallucinations or dangerous instructions.<n>Evidential Uncertainty Quantification (EUQ) captures both information conflict and ignorance for effective detection of LVLM misbehaviors.
arXiv Detail & Related papers (2026-02-05T10:51:39Z)
[De|Re]constructing VLMs' Reasoning in Counting [2.1856941852799134]
We study the reasoning skills of seven state-of-the-art Vision-Language Models (VLMs) in the counting task under controlled experimental conditions.<n>A layer-wise analysis reveals that errors are due to incorrect mapping of the last-layer representation into the output space.<n>Our targeted training shows that fine-tuning just the output layer improves accuracy by up to 21%.
arXiv Detail & Related papers (2025-10-22T13:08:47Z)
Can VLMs Recall Factual Associations From Visual References? [30.821053378797007]
We identify a systematic deficiency in the multimodal grounding of Vision Language Models (VLMs)<n>Forcing VLMs to rely on image representations of an entity halves their ability to recall factual knowledge.<n>We show that such linking failures are correlated with the expression of distinct patterns in model internal states.
arXiv Detail & Related papers (2025-08-22T16:47:37Z)
Grounded Chain-of-Thought for Multimodal Large Language Models [66.04061083611863]
We propose a new learning task for multimodal large language models (MLLMs) called Grounded Chain-of-Thought (GCoT)<n>GCoT is keen to helping MLLMs to recognize and ground the relevant visual cues step by step, thereby predicting the correct answer with grounding coordinates as the intuitive basis.<n>To facilitate this task, we also carefully design and construct a dataset called multimodal grounded chain-of-thought (MM-GCoT) consisting of 24,022 GCoT examples for 5,033 images.
arXiv Detail & Related papers (2025-03-17T04:07:47Z)
Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas [69.56484419619919]
We study the spatial reasoning challenge from the lens of mechanistic interpretability.<n>We observe that successful spatial reasoning correlates strongly with the model's ability to align its attention with actual object locations.<n>Motivated by these findings, we propose ADAPTVIS to sharpen the attention on highly relevant regions when confident.
arXiv Detail & Related papers (2025-03-03T17:57:03Z)
Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactor is a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment.<n>We evaluate 20 frontier Multimodal Large Language Models (MLLMs) from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families.<n>The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination.
arXiv Detail & Related papers (2025-02-23T04:21:32Z)
Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning [151.4060202671114]
multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing vision-language tasks.<n>This paper introduces a novel bottom-up reasoning framework to address hallucinations in MLLMs.<n>Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge.
arXiv Detail & Related papers (2024-12-15T09:10:46Z)
Have the VLMs Lost Confidence? A Study of Sycophancy in VLMs [44.56018149475948]
sycophancy is a prevalent hallucination that poses significant challenges to visual language models (VLMs) We propose a synthetic dataset for training and employ methods based on prompts, supervised fine-tuning, and DPO to mitigate sycophancy. Our findings indicate that the ability to prevent sycophancy is predominantly observed in higher layers of the model.
arXiv Detail & Related papers (2024-10-15T05:48:14Z)
MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection [107.15164718585666]
We investigate the root cause of VLMs' biased prediction under the open vocabulary detection context. Our observations lead to a simple yet effective paradigm, coded MarvelOVD, that generates significantly better training targets. Our method outperforms the other state-of-the-arts by significant margins.
arXiv Detail & Related papers (2024-07-31T09:23:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.