Fugu-MT 論文翻訳(概要): Attention in Space: Functional Roles of VLM Heads for Spatial Reasoning

論文の概要: Attention in Space: Functional Roles of VLM Heads for Spatial Reasoning

arxiv url: http://arxiv.org/abs/2603.20662v1
Date: Sat, 21 Mar 2026 05:36:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.025839
Title: Attention in Space: Functional Roles of VLM Heads for Spatial Reasoning
Title（参考訳）: 宇宙における注意:空間推論におけるVLMヘッドの機能的役割
Authors: Xueqi Ma, Shuo Yang, Yanbei Jiang, Shu Liu, Zhenzhen Liu, Jiayang Ao, Xingjun Ma, Sarah Monazam Erfani, James Bailey,
Abstract要約: 複雑な空間推論質問をステップバイステップのサブクエストに分解するデータセットであるCogVSRを紹介する。本研究は,これらの機能に特有なアテンションヘッドを識別・特徴付けるための探索フレームワークを開発する。本研究では,潜在空間ヘッドを活性化し,空間理解を改善する手法を提案する。
参考スコア（独自算出の注目度）: 43.03674069044073
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite remarkable advances in large Vision-Language Models (VLMs), spatial reasoning remains a persistent challenge. In this work, we investigate how attention heads within VLMs contribute to spatial reasoning by analyzing their functional roles through a mechanistic interpretability lens. We introduce CogVSR, a dataset that decomposes complex spatial reasoning questions into step-by-step subquestions designed to simulate human-like reasoning via a chain-of-thought paradigm, with each subquestion linked to specific cognitive functions such as spatial perception or relational reasoning. Building on CogVSR, we develop a probing framework to identify and characterize attention heads specialized for these functions. Our analysis across diverse VLM families reveals that these functional heads are universally sparse, vary in number and distribution across functions. Notably, spatially specialized heads are fewer than those for other cognitive functions, highlighting their scarcity. We propose methods to activate latent spatial heads, improving spatial understanding. Intervention experiments further demonstrate their critical role in spatial reasoning: removing functional heads leads to performance degradation, while emphasizing them enhances accuracy. This study provides new interpretability driven insights into how VLMs attend to space and paves the way for enhancing complex spatial reasoning in multimodal models.
Abstract（参考訳）: 大きな視覚言語モデル(VLM)の顕著な進歩にもかかわらず、空間的推論は永続的な課題である。本研究では,VLM内の注意頭が空間的推論にどのように寄与するかを,機械的解釈性レンズを用いて解析することによって検討する。我々は,複雑な空間的推論質問を段階的に分解するデータセットであるCogVSRを紹介し,空間的知覚や関係的推論といった特定の認知機能に関連付けられた,チェーン・オブ・シント・パラダイムを通じて人間のような推論をシミュレートする。 CogVSRを基盤として,これらの機能に特有な注意ヘッドを識別・特徴付けるための探索フレームワークを開発した。多様なVLMファミリーをまたいだ分析により,これらの機能的頭部は普遍的に疎結合であり,機能間の数や分布が異なることが明らかとなった。特に、空間的に特殊化された頭は、他の認知機能よりも小さく、その不足を強調している。本研究では,潜在空間ヘッドを活性化し,空間理解を改善する手法を提案する。干渉実験は空間的推論においてその重要な役割を更に証明する: 機能的ヘッドの除去は性能の低下につながるが、それらを強調することで精度が向上する。この研究は、VLMが空間にどのように参加するかに関する新しい解釈可能性駆動の洞察を与え、マルチモーダルモデルにおける複雑な空間的推論を強化するための道を開く。

論文の概要: Attention in Space: Functional Roles of VLM Heads for Spatial Reasoning

関連論文リスト