Fugu-MT 論文翻訳(概要): SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

論文の概要: SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

arxiv url: http://arxiv.org/abs/2604.22409v1
Date: Fri, 24 Apr 2026 10:06:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-27 15:36:26.419551
Title: SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments
Title（参考訳）: SpaMEM: 身体環境における知覚記憶の統合による動的空間推論のベンチマーク
Authors: Chih-Ting Liao, Xi Xiao, Chunlei Meng, Zhangquan Chen, Yitong Qiao, Weilin Zhou, Tianyang Wang, Xu Zheng, Xin Cao,
Abstract要約: 本稿では,空間的信念進化の力学を分離した大規模診断ベンチマークであるSpaMEMを紹介する。 SpaMEMは,4つのモードにわたる10,601,392の高忠実度画像を備えた,物理的に接地されたデータセット上に構築されている。我々は,空間推論を3段階の階層として15の診断タスクで定式化する。
参考スコア（独自算出の注目度）: 19.997461654311994
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) have advanced static visual--spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric observations under environmental change. We introduce SpaMEM (Spatial Memory from Action Sequences), a large-scale diagnostic benchmark that isolates the mechanics of spatial belief evolution via action-conditioned scene transformations (spawn, place, remove) over long interaction horizons. SpaMEM is built on a physically grounded dataset with 10,601,392 high-fidelity images across four modalities (RGB, depth, instance, semantic segmentation), collected from 25,000+ interaction sequences in 1,000 procedurally generated houses. We formalize embodied spatial reasoning as a three-level hierarchy with 15 diagnostic tasks: Level 1 measures atomic spatial perception from single observations; Level 2 probes temporal reasoning with oracle textual state histories to factor out perceptual noise; and Level 3 requires end-to-end belief maintenance from raw visual streams under the same task dimensions. We further evaluate both short-term (step-wise) updates and long-term (episodic) reconstruction. Benchmarking representative open-source VLM families reveals a consistent stacked bottleneck: coordinate-consistent grounding remains a hard ceiling, and the sharp collapse from Level 2 to Level 3 exposes a pronounced symbolic scaffolding dependency, where models succeed with text-based bookkeeping but struggle to sustain robust visual memory. SpaMEM provides a granular diagnostic standard and motivates explicit mechanisms for state representation, belief revision, and long-horizon episodic integration.
Abstract（参考訳）: マルチモーダル・大規模言語モデル(MLLM)は、高度な静的な視覚空間的推論を持っているが、環境変化下でのエゴセントリックな観察から信念を継続的に修正しなければならない実施環境において、長い水平空間的コヒーレンスを維持することはしばしば失敗する。 SpaMEM(Spatial Memory from Action Sequences)は,行動条件付きシーン変換(スプーン,場所,削除)による空間的信念進化の力学を,長時間の相互作用地平線上で分離する大規模診断ベンチマークである。 SpaMEMは、4つのモード(RGB、深さ、例、セマンティックセグメンテーション)にまたがる10,601,392の高忠実なイメージを持つ物理的に基底化されたデータセット上に構築されており、1000の手続き的に生成されたハウスで25,000以上の相互作用シーケンスから収集されている。第1レベルは、単一観測から原子空間知覚を測定すること、第2レベルは、オラクルのテキスト状態履歴による時間的推論を探索して知覚ノイズを判断すること、第3レベルは、同じタスク次元の生の視覚ストリームからのエンドツーエンドの信念維持が必要である。さらに,短期的(段階的に)更新と長期的(側方的)再建の両面での評価を行った。座標一貫性の接地はハード天井のままであり、レベル2からレベル3への急激な崩壊は、テキストベースの簿記でモデルが成功するが、堅牢なビジュアルメモリを維持するのに苦労する、明らかな象徴的な足場依存性を明らかにする。 SpaMEMは、詳細な診断基準を提供し、状態表現、信念修正、長期水平統合の明確なメカニズムを動機付けている。

論文の概要: SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

関連論文リスト