Fugu-MT 論文翻訳(概要): Scaling Spatial Intelligence with Multimodal Foundation Models

論文の概要: Scaling Spatial Intelligence with Multimodal Foundation Models

arxiv url: http://arxiv.org/abs/2511.13719v1
Date: Mon, 17 Nov 2025 18:59:33 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-18 18:52:09.700955
Title: Scaling Spatial Intelligence with Multimodal Foundation Models
Title（参考訳）: マルチモーダル基礎モデルによる空間知能のスケーリング
Authors: Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang,
Abstract要約: マルチモーダル・ファンデーション・モデルは空間知能に驚くべき欠陥をみせています我々は、高性能で堅牢な空間知性を構築するために、原則的なアプローチをとる。 SenseNova-SIは、幅広い空間インテリジェンスベンチマークで前例のない性能を示している。
参考スコア（独自算出の注目度）: 90.32537840125009
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.
Abstract（参考訳）: 顕著な進歩にもかかわらず、マルチモーダル基礎モデルは依然として空間知能に顕著な欠陥をみせている。本研究では,視覚的理解モデル(Qwen3-VLとInternVL3)や統一的理解モデル(Bagel)を含む,確立されたマルチモーダル基盤を基盤とした,SenseNova-SIファミリー内の空間知性を育成するためのマルチモーダル基盤モデルのスケールアップを検討する。 SenseNova-SI-8M:800万種類のデータサンプルを厳密な空間能力分類の下で体系的にキュレートすることで、高性能で堅牢な空間知能を構築するための原則的なアプローチをとる。 SenseNova-SI は、VSI-Bench の68.7%、MMSI の43.3%、MindCube の85.6%、ViewSpatial の54.6%、SITE の50.1%など、幅広い空間知能ベンチマークで前例のない性能を示している。さらに、データスケーリングの影響を分析し、多様なデータトレーニングによって実現された創発的一般化能力の早期の兆候について議論し、オーバーフィッティングや言語ショートカットのリスクを分析し、空間連鎖推論に関する予備的研究を行い、下流アプリケーションの可能性を検証する。 SenseNova-SIは進行中のプロジェクトであり、このレポートは継続的に更新される予定である。新しく訓練されたすべてのマルチモーダル基礎モデルは、この方向のさらなる研究を促進するために、一般公開されている。

論文の概要: Scaling Spatial Intelligence with Multimodal Foundation Models

関連論文リスト