Fugu-MT 論文翻訳(概要): On the Generalization Capacities of MLLMs for Spatial Intelligence

論文の概要: On the Generalization Capacities of MLLMs for Spatial Intelligence

arxiv url: http://arxiv.org/abs/2603.06704v1
Date: Thu, 05 Mar 2026 14:46:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:12.893823
Title: On the Generalization Capacities of MLLMs for Spatial Intelligence
Title（参考訳）: 空間知能のためのMLLMの一般化能力について
Authors: Gongjie Zhang, Wenhao Li, Quanhao Qian, Jiuniu Wang, Deli Zhao, Shijian Lu, Ran Xu,
Abstract要約: 我々は、RGBのみのアプローチは、カメラをまたいで一般化する能力に根本的な欠陥があると主張している。これによりMLLMは、真の3次元幾何学的原理を学習するのではなく、トレーニングカメラの分布に過度に適合することを示す。空間MLLMのためのカメラ対応MLLMフレームワークを提案する。
参考スコア（独自算出の注目度）: 72.21075026598761
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Multimodal Large Language Models (MLLMs) that directly process RGB inputs for tasks like 3D localization and navigation have shown remarkable potential. However, we argue that these RGB-only approaches are fundamentally flawed in their ability to generalize across cameras. By ignoring camera parameters, they entangle an object's physical properties with the camera's perspective, creating an irresolvable ambiguity. We show this leads MLLMs to overfit to the training camera distribution, rather than learning true and generalizable 3D geometric principles. To address this, we propose Camera-Aware MLLM framework for spatial MLLMs. It learns generalizable spatial reasoning by: (i) injecting camera intrinsics via a dense embedding that conditions each visual token; (ii) introducing a camera-aware data augmentation strategy that synthetically varies camera parameters, forcing the model to disentangle camera properties from scene content; and (iii) distilling geometric priors from a 3D vision foundation model. Extensive experiments demonstrate that camera-aware MLLMs substantially outperform their naive counterparts, particularly in cross-camera generalization tests on spatially-grounded tasks, indicating that camera-awareness is not only beneficial but also a prerequisite for robust and generalizable spatial intelligence in MLLMs.
Abstract（参考訳）: 3DローカライゼーションやナビゲーションといったタスクのためにRGB入力を直接処理するマルチモーダル大規模言語モデル(MLLM)は、非常に有益である。しかし、これらのRGBのみのアプローチは、カメラにまたがる一般化能力に根本的な欠陥があることを論じる。カメラパラメータを無視することで、カメラの視点で物体の物理的特性を絡ませ、不可解な曖昧さを生み出す。これによりMLLMは、真の3次元幾何学的原理を学習するのではなく、トレーニングカメラの分布に過度に適合することを示す。そこで我々は,空間MLLMのためのカメラ対応MLLMフレームワークを提案する。一般化可能な空間的推論を学習する。一それぞれの視覚的トークンを条件づけた密着な埋め込みにより、カメラ本質を注入すること。二カメラパラメータを合成的に変化させるカメラ対応データ拡張戦略を導入し、シーン内容からカメラ特性を遠ざけるよう強制する。三立体視基盤モデルから幾何学的先行物を蒸留すること。広汎な実験により、カメラ対応MLLMは、特に空間的に接地されたタスクにおけるクロスカメラの一般化テストにおいて、彼らのナイーブな能力を大幅に上回っていることが示され、カメラ認識は益であるだけでなく、MLLMにおける堅牢で一般化可能な空間知能の必要条件でもあることが示された。

論文の概要: On the Generalization Capacities of MLLMs for Spatial Intelligence

関連論文リスト