Investigating The Functional Roles of Attention Heads in Vision Language Models: Evidence for Reasoning Modules
- URL: http://arxiv.org/abs/2512.10300v1
- Date: Thu, 11 Dec 2025 05:42:53 GMT
- Title: Investigating The Functional Roles of Attention Heads in Vision Language Models: Evidence for Reasoning Modules
- Authors: Yanbei Jiang, Xueqi Ma, Shu Liu, Sarah Monazam Erfani, Tongliang Liu, James Bailey, Jey Han Lau, Krista A. Ehinger,
- Abstract summary: We introduce CogVision, a dataset that decomposes complex multimodal questions into step-by-step subquestions.<n>Using a probing-based methodology, we identify attention heads that specialize in these functions and characterize them as functional heads.<n>Our analysis reveals that these functional heads are universally sparse, vary in number and distribution across functions, and mediate interactions and hierarchical organization.
- Score: 76.21320451720764
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite excelling on multimodal benchmarks, vision-language models (VLMs) largely remain a black box. In this paper, we propose a novel interpretability framework to systematically analyze the internal mechanisms of VLMs, focusing on the functional roles of attention heads in multimodal reasoning. To this end, we introduce CogVision, a dataset that decomposes complex multimodal questions into step-by-step subquestions designed to simulate human reasoning through a chain-of-thought paradigm, with each subquestion associated with specific receptive or cognitive functions such as high-level visual reception and inference. Using a probing-based methodology, we identify attention heads that specialize in these functions and characterize them as functional heads. Our analysis across diverse VLM families reveals that these functional heads are universally sparse, vary in number and distribution across functions, and mediate interactions and hierarchical organization. Furthermore, intervention experiments demonstrate their critical role in multimodal reasoning: removing functional heads leads to performance degradation, while emphasizing them enhances accuracy. These findings provide new insights into the cognitive organization of VLMs and suggest promising directions for designing models with more human-aligned perceptual and reasoning abilities.
Related papers
- VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization [87.26383908243878]
We show that vision encoders within Multimodal Large Language Models exhibit deficiencies in their dense feature representations.<n>We propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training.
arXiv Detail & Related papers (2026-02-10T16:08:19Z) - Cognitive Mirrors: Exploring the Diverse Functional Roles of Attention Heads in LLM Reasoning [54.12174882424842]
Large language models (LLMs) have achieved state-of-the-art performance in a variety of tasks, but remain largely opaque in terms of their internal mechanisms.<n>We propose a novel interpretability framework to systematically analyze the roles and behaviors of attention heads.
arXiv Detail & Related papers (2025-12-03T10:24:34Z) - From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models [66.36007274540113]
Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world.<n>They often exhibit a shallow and incoherent integration when acquiring information (Perception) and conducting reasoning (Cognition)<n>This survey introduces a novel and unified analytical framework: From Perception to Cognition"
arXiv Detail & Related papers (2025-09-29T18:25:40Z) - From Black Boxes to Transparent Minds: Evaluating and Enhancing the Theory of Mind in Multimodal Large Language Models [17.235722538085263]
This study adopts an approach based on internal mechanisms to provide an interpretability-driven assessment of Theory of Mind (ToM) in large language models (MLLMs)<n>We first construct a multimodal ToM test dataset, GridToM, which incorporates diverse belief testing tasks and perceptual information from multiple perspectives.<n>Next, our analysis shows that attention heads in multimodal large models can distinguish cognitive information across perspectives, providing evidence of ToM capabilities.
arXiv Detail & Related papers (2025-06-17T06:27:42Z) - Caption This, Reason That: VLMs Caught in the Middle [3.4820139118440676]
Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years.<n>They still lag behind human capabilities in specific visual tasks such as counting or relational reasoning.<n>We analyze VLM performance along core cognitive axes: Perception, Attention, and Memory.
arXiv Detail & Related papers (2025-05-24T14:25:48Z) - Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactor is a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment.<n>We evaluate 20 frontier Multimodal Large Language Models (MLLMs) from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families.<n>The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination.
arXiv Detail & Related papers (2025-02-23T04:21:32Z) - A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs [3.2228025627337864]
This paper introduces a structured evaluation framework to dissect the perception-reasoning interface in Vision-Language Models (VLMs)<n>We propose three distinct evaluation paradigms, mirroring human problem-solving strategies.<n>Applying this framework, we demonstrate that CA, leveraging powerful language models for reasoning over rich, independently generated descriptions, achieves new state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2025-01-23T12:42:42Z) - Attention Heads of Large Language Models: A Survey [10.136767972375639]
We aim to demystify the internal reasoning processes of Large Language Models (LLMs) by systematically exploring the roles and mechanisms of attention heads.<n>We first introduce a novel four-stage framework inspired by the human thought process: Knowledge Recalling, In-Context Identification, Latent Reasoning, and Expression Preparation.<n>We analyze the experimental methodologies used to discover these special heads, dividing them into two categories: Modeling-Free and Modeling-Required methods.
arXiv Detail & Related papers (2024-09-05T17:59:12Z) - Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks.
We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture.
Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z) - Multilingual Multi-Aspect Explainability Analyses on Machine Reading Comprehension Models [76.48370548802464]
This paper focuses on conducting a series of analytical experiments to examine the relations between the multi-head self-attention and the final MRC system performance.
We discover that passage-to-question and passage understanding attentions are the most important ones in the question answering process.
Through comprehensive visualizations and case studies, we also observe several general findings on the attention maps, which can be helpful to understand how these models solve the questions.
arXiv Detail & Related papers (2021-08-26T04:23:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.