Fugu-MT 論文翻訳(概要): GeoSense: Internalizing Geometric Necessity Perception for Multimodal Reasoning

論文の概要: GeoSense: Internalizing Geometric Necessity Perception for Multimodal Reasoning

arxiv url: http://arxiv.org/abs/2603.10370v1
Date: Wed, 11 Mar 2026 03:32:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-12 16:22:32.763915
Title: GeoSense: Internalizing Geometric Necessity Perception for Multimodal Reasoning
Title（参考訳）: GeoSense:マルチモーダル推論のための幾何学的必要知覚の内在化
Authors: Ruiheng Liu, Haihong Hao, Mingfei Han, Xin Gu, Kecheng Zhang, Changlin Li, Xiaojun Chang,
Abstract要約: マルチモーダル大規模言語モデル(MLLM)の限られた空間的理解を克服する枠組みを開発する。この枠組みは,2次元の手がかりが不十分と判断された場合の推論において,幾何学的特徴を自律的に関与させることにより,知覚的不整合を意識したモデルを実現する。
参考スコア（独自算出の注目度）: 51.63457948949102
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Advancing towards artificial superintelligence requires rich and intelligent perceptual capabilities. A critical frontier in this pursuit is overcoming the limited spatial understanding of Multimodal Large Language Models (MLLMs), where geometry information is essential. Existing methods often address this by rigidly injecting geometric signals into every input, while ignoring their necessity and adding computation overhead. Contrary to this paradigm, our framework endows the model with an awareness of perceptual insufficiency, empowering it to autonomously engage geometric features in reasoning when 2D cues are deemed insufficient. To achieve this, we first introduce an independent geometry input channel to the model architecture and conduct alignment training, enabling the effective utilization of geometric features. Subsequently, to endow the model with perceptual awareness, we curate a dedicated spatial-aware supervised fine-tuning dataset. This serves to activate the model's latent internal cues, empowering it to autonomously determine the necessity of geometric information. Experiments across multiple spatial reasoning benchmarks validate this approach, demonstrating significant spatial gains without compromising 2D visual reasoning capabilities, offering a path toward more robust, efficient and self-aware multi-modal intelligence.
Abstract（参考訳）: 人工超知能への適応には、豊かでインテリジェントな知覚能力が必要である。この追求における重要なフロンティアは、幾何学情報が不可欠であるマルチモーダル大言語モデル(MLLM)の空間的理解の制限を克服することである。既存の手法では、全ての入力に幾何学的信号を厳格に注入し、必要を無視し、計算オーバーヘッドを追加することで、この問題に対処することが多い。このパラダイムとは対照的に、我々のフレームワークは知覚障害を意識してモデルをサポートし、2次元キューが不十分と判断された場合の推論において、幾何学的特徴を自律的に行うことができる。これを実現するために、まずモデルアーキテクチャに独立した幾何学入力チャネルを導入し、アライメントトレーニングを行い、幾何学的特徴を効果的に活用する。その後、知覚的認識をモデルに与えるために、専用空間認識型微調整データセットをキュレートする。これはモデルの潜伏した内部キューを活性化し、幾何学的情報の必要性を自律的に決定する権限を与える。複数の空間的推論ベンチマークの実験により、このアプローチが検証され、2次元の視覚的推論能力を損なうことなく、より堅牢で効率的で自己認識的なマルチモーダルインテリジェンスへの道筋が示される。

論文の概要: GeoSense: Internalizing Geometric Necessity Perception for Multimodal Reasoning

関連論文リスト