Fugu-MT 論文翻訳(概要): 3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models

論文の概要: 3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models

arxiv url: http://arxiv.org/abs/2603.07751v1
Date: Sun, 08 Mar 2026 17:57:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:15.191596
Title: 3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models
Title（参考訳）: 3ViewSense:視覚・言語モデルにおけるオーソグラフィ視点からの空間的・精神的視点推論
Authors: Shaoxiong Zhan, Yanlin Lai, Zheng Liu, Hai Lin, Shen Li, Xiaodong Cai, Zijian Lin, Wen Huang, Hai-Tao Zheng,
Abstract要約: 視覚言語モデルは、2次元の観察からコヒーレントな3次元の心的表現を構築することができない。オーソグラフィビューにおける空間推論の基盤となるフレームワークである textbf3ViewSense を紹介する。空間的推論ベンチマークによる実験結果から,提案手法が既存のベースラインを著しく上回ることを示す。
参考スコア（独自算出の注目度）: 16.924616915709123
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current Large Language Models have achieved Olympiad-level logic, yet Vision-Language Models paradoxically falter on elementary spatial tasks like block counting. This capability mismatch reveals a critical ``spatial intelligence gap,'' where models fail to construct coherent 3D mental representations from 2D observations. We uncover this gap via diagnostic analyses showing the bottleneck is a missing view-consistent spatial interface rather than insufficient visual features or weak reasoning. To bridge this, we introduce \textbf{3ViewSense}, a framework that grounds spatial reasoning in Orthographic Views. Drawing on engineering cognition, we propose a ``Simulate-and-Reason'' mechanism that decomposes complex scenes into canonical orthographic projections to resolve geometric ambiguities. By aligning egocentric perceptions with these allocentric references, our method facilitates explicit mental rotation and reconstruction. Empirical results on spatial reasoning benchmarks demonstrate that our method significantly outperforms existing baselines, with consistent gains on occlusion-heavy counting and view-consistent spatial reasoning. The framework also improves the stability and consistency of spatial descriptions, offering a scalable path toward stronger spatial intelligence in multimodal systems.
Abstract（参考訳）: 現在の大規模言語モデルはオリンピアードレベルの論理を達成しているが、視覚言語モデルはブロックカウントのような基本的な空間的タスクでパラドックス的に失敗している。この能力のミスマッチは、モデルが2次元の観察からコヒーレントな3次元の心的表現を構築するのに失敗する「空間的知能ギャップ」を批判的に明らかにする。このギャップは、視覚的特徴の不足や弱い推論よりも、視界に一貫性のある空間的インターフェースが欠如していることが診断分析によって明らかになった。これを埋めるために、オーソグラフィビューで空間的推論を基盤とするフレームワークである \textbf{3ViewSense} を導入する。工学的認知に基づいて,複雑なシーンを正準正弦投影に分解して幾何学的曖昧さを解消する<Simulate-and-Reason'機構を提案する。本手法は、自我中心の知覚をこれらの同心中心の参照と整合させることにより、明示的な心的回転と再構築を促進する。空間的推論ベンチマークによる実験結果から,提案手法は既存の基準線を著しく上回り,オクルージョン重計数やビュー一貫性の空間的推論に一貫した利得が得られた。このフレームワークはまた、空間記述の安定性と一貫性を改善し、マルチモーダルシステムにおけるより強力な空間知性へのスケーラブルなパスを提供する。

論文の概要: 3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models

関連論文リスト