Fugu-MT 論文翻訳(概要): ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

論文の概要: ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

arxiv url: http://arxiv.org/abs/2605.18746v1
Date: Mon, 18 May 2026 17:59:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:50.228883
Title: ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop
Title（参考訳）: ESI-Bench:知覚-行動ループを閉じる身体的空間情報を目指して
Authors: Yining Hong, Jiageng Liu, Han Yin, Manling Li, Leonidas Guibas, Li Fei-Fei, Jiajun Wu, Yejin Choi,
Abstract要約: 我々は,OmniGibson上に構築された10のタスクカテゴリと29のサブカテゴリにまたがる空間知能の具体化ベンチマークを開発した。我々は最先端のMLLMの実験を行い、活発な探索が受動的に優れていることを発見した。矛盾した視点を求め、信念を改定する人間とは異なり、モデルは証拠の品質に関わらず、高い信頼をもって早々に行動する。
参考スコア（独自算出の注目度）: 55.468404995694975
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen - occluded structure, dynamics, containment, and functionality that cannot be resolved from passive sensing alone. We move beyond prior formulations of spatial intelligence that assume oracle observations by recasting the observer as an actor. We introduce ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke's core knowledge systems. Agents must decide what abilities to deploy - perception, locomotion, and manipulation - and how to sequence them to actively accumulate task-relevant evidence. We conduct extensive experiments on state-of-the-art MLLMs and find that active exploration substantially outperforms passive counterparts, with agents spontaneously discovering emergent spatial strategies without explicit instructions, while random multi-view often adds noise rather than signal despite consuming far more images. Most failures stem not from weak perception but from action blindness: poor action choices lead to poor observations, which in turn drive cascading errors. While explicit 3D grounding stabilizes reasoning on depth-sensitive tasks, imperfect 3D representation proves more harmful than 2D baselines by distorting spatial relations. Human studies further reveal that unlike humans who seek falsifying viewpoints and revise beliefs under contradiction, models commit prematurely with high confidence regardless of evidence quality, exposing a metacognitive gap that neither better perception nor more embodied interaction alone can close.
Abstract（参考訳）: エージェントは観察を得るために行動し、アクションの関数として観察がどのように変化するかについての推論を行う。目に見えないものを受動的に処理するのではなく、排除された構造、動的構造、封じ込め、そして受動的センシングだけでは解決できない機能を積極的に発見する。我々は、観察者をアクターとして再キャストすることで、オラクルの観察を仮定する空間知能の以前の定式化を超えて移動する。我々は,10のタスクカテゴリと29のサブカテゴリにまたがる空間知能の包括的ベンチマークであるESI-BENCHを紹介した。エージェントは、何をデプロイする能力 - 知覚、移動、操作 - と、タスク関連エビデンスを積極的に蓄積するためにどのようにシーケンスするかを決定する必要がある。我々は、最先端MLLMの広範な実験を行い、アクティブな探索は受動的に優れており、エージェントは明示的な指示なしに自発的に創発的な空間戦略を発見できるのに対し、ランダムなマルチビューは、はるかに多くの画像を消費しているにもかかわらず、信号よりもノイズを付加することが多い。ほとんどの失敗は、弱い知覚ではなく、行動の盲点から来ている: 行動選択の貧弱さは、観察の低さを招き、結果的にカスケードエラーを引き起こす。空間的関係を歪ませることにより, 空間的関係を歪ませることにより, 不完全な3次元表現は2次元ベースラインよりも有害であることを示す。人間の研究では、矛盾した視点を求め、信念を改定する人間とは異なり、モデルは証拠の品質に関わらず、早期に高い信頼を持って行動し、より良い知覚もより具体化された相互作用も閉じられないメタ認知的ギャップを露呈する。

論文の概要: ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

関連論文リスト