Fugu-MT 論文翻訳(概要): Perception-Aware Multimodal Spatial Reasoning from Monocular Images

論文の概要: Perception-Aware Multimodal Spatial Reasoning from Monocular Images

arxiv url: http://arxiv.org/abs/2603.06985v1
Date: Sat, 07 Mar 2026 02:05:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:13.521261
Title: Perception-Aware Multimodal Spatial Reasoning from Monocular Images
Title（参考訳）: 単眼画像からの知覚認識型マルチモーダル空間推論
Authors: Yanchun Cheng, Rundong Wang, Xulei Yang, Alok Prakash, Daniela Rus, Marcelo H Ang, ShiJie Li,
Abstract要約: 単眼画像からの空間的推論は自律運転には不可欠です現在のヴィジュアルランゲージモデル(VLM)は、微粒な幾何学的知覚に苦慮している。本稿では,VLMを明示的な対象中心の接地能力を持つ知覚認識型マルチモーダル推論フレームワークを提案する。
参考スコア（独自算出の注目度）: 57.42071289037214
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Spatial reasoning from monocular images is essential for autonomous driving, yet current Vision-Language Models (VLMs) still struggle with fine-grained geometric perception, particularly under large scale variation and ambiguous object appearance. We propose a simple yet effective perception-aware multimodal reasoning framework that equips VLMs with explicit object-centric grounding ability. Instead of relying on textual bounding-box outputs, each referred object is represented using all Visual Reference Tokens (VRTs) within its spatial extent, enabling visual evidence and textual reasoning to be processed jointly in a unified token space. To further strengthen cross-modal interaction, we construct a Multimodal Chain-of-Thought (MM-CoT) dataset that injects aligned visual and textual reasoning signals. A deterministic ordering strategy is introduced to make supervision over inherently unordered VRT sets fully compatible with the VLM's autoregressive next-token prediction. With only standard supervised fine-tuning, our method achieves substantial improvements on the SURDS benchmark, outperforming previous approaches - including those using RL-based post-training - by a large margin across both single-object and multi-object tasks. These results demonstrate that accurate perception and multimodal reasoning are mutually reinforcing, and together form the key to robust spatial understanding in challenging monocular driving scenarios.
Abstract（参考訳）: 単眼画像からの空間的推論は自律運転には不可欠であるが、現在のビジョン・ランゲージ・モデル(VLM)は、特に大規模変動と曖昧な物体の外観において、微粒な幾何学的知覚に苦慮している。本稿では,VLMを明示的な対象中心のグラウンド化能力を持つ簡易かつ効果的な知覚認識型マルチモーダル推論フレームワークを提案する。テキスト境界ボックスの出力に頼る代わりに、参照オブジェクトはその空間範囲内ですべてのVisual Reference Tokens (VRT)を使用して表現され、視覚的エビデンスとテキスト推論を統一トークン空間で共同で処理することができる。マルチモーダル・チェーン・オブ・ソート(MM-CoT)データセットを構築し,協調した視覚的およびテキスト的推論信号を注入する。決定論的順序付け戦略を導入し、本質的に順序付けされていないVRTセットの監督をVLMの自己回帰的次トーケン予測と完全に整合させる。標準教師付き微調整のみにより、SURDSベンチマークの大幅な改善を実現し、RLベースのポストトレーニングを含む従来の手法よりも、単目的タスクと多目的タスクの両方で大きなマージンを達成している。これらの結果は、正確な知覚とマルチモーダル推論が相互に強化されていることを示し、同時に、単分子駆動シナリオの挑戦における頑健な空間的理解の鍵を形成する。

論文の概要: Perception-Aware Multimodal Spatial Reasoning from Monocular Images

関連論文リスト