Fugu-MT 論文翻訳(概要): Why Do MLLMs Struggle with Spatial Understanding? A Systematic Analysis from Data to Architecture

論文の概要: Why Do MLLMs Struggle with Spatial Understanding? A Systematic Analysis from Data to Architecture

arxiv url: http://arxiv.org/abs/2509.02359v1
Date: Tue, 02 Sep 2025 14:22:43 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:04.061677
Title: Why Do MLLMs Struggle with Spatial Understanding? A Systematic Analysis from Data to Architecture
Title（参考訳）: MLLMが空間的理解と相互作用する理由 : データからアーキテクチャへの体系的分析
Authors: Wanyue Zhang, Yibin Huang, Yangbin Xu, JingJing Huang, Helu Zhi, Shuo Ren, Wang Xu, Jiajun Zhang,
Abstract要約: データと建築の両面から空間的理解を体系的に分析する。データの観点からは、トレーニングデータが増加するにつれて空間理解の性能は急速に収束する。アーキテクチャの観点からは、空間的理解は言語モデルよりも視覚エンコーダ内の位置エンコーダに大きく依存していることが分かる。
参考スコア（独自算出の注目度）: 16.15618237704827
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spatial understanding is essential for Multimodal Large Language Models (MLLMs) to support perception, reasoning, and planning in embodied environments. Despite recent progress, existing studies reveal that MLLMs still struggle with spatial understanding. However, existing research lacks a comprehensive and systematic evaluation of these limitations, often restricted to isolated scenarios, such as single-view or video. In this work, we present a systematic analysis of spatial understanding from both data and architectural perspectives across three representative scenarios: single-view, multi-view, and video. We propose a benchmark named MulSeT (Multi-view Spatial Understanding Tasks), and design a series of experiments to analyze the spatial reasoning capabilities of MLLMs. From the data perspective, the performance of spatial understanding converges quickly as the training data increases, and the upper bound is relatively low, especially for tasks that require spatial imagination. This indicates that merely expanding training data is insufficient to achieve satisfactory performance. From the architectural perspective, we find that spatial understanding relies more heavily on the positional encoding within the visual encoder than within the language model, in both cascaded and native MLLMs. Moreover, we explore reasoning injection and envision future improvements through architectural design to optimize spatial understanding. These insights shed light on the limitations of current MLLMs and suggest new directions for improving spatial reasoning capabilities through data scaling and architectural tuning.
Abstract（参考訳）: マルチモーダル大規模言語モデル(MLLM)は, 環境認識, 推論, 計画を支援するために, 空間的理解が不可欠である。近年の進歩にもかかわらず、MLLMが空間的理解に苦戦していることが明らかになっている。しかし、既存の研究はこれらの制限を包括的かつ体系的に評価しておらず、多くの場合、シングルビューやビデオのような独立したシナリオに制限されている。本研究では,データと建築の両面から空間的理解を体系的に分析し,一視点,多視点,ビデオの3つの代表的なシナリオについて述べる。我々は,MulSeT (Multi-view Spatial Understanding Tasks) というベンチマークを提案し,MLLMの空間的推論能力を解析するための一連の実験を設計する。データの観点からは、トレーニングデータが増加するにつれて空間理解の性能は急速に収束し、特に空間的想像力を必要とするタスクの場合、上界は比較的低い。これは、単にトレーニングデータを拡張しても、十分なパフォーマンスを達成するには不十分であることを示している。建築面から見ると、空間的理解は言語モデルよりも視覚エンコーダ内の位置エンコーディングに大きく依存している。さらに、推論注入について検討し、空間的理解を最適化するアーキテクチャ設計による将来の改善を構想する。これらの知見は、現在のMLLMの限界に光を当て、データスケーリングとアーキテクチャチューニングを通じて空間推論能力を改善するための新しい方向性を提案する。

論文の概要: Why Do MLLMs Struggle with Spatial Understanding? A Systematic Analysis from Data to Architecture

関連論文リスト