Fugu-MT 論文翻訳(概要): EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

論文の概要: EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

arxiv url: http://arxiv.org/abs/2603.09731v2
Date: Thu, 12 Mar 2026 12:40:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:25.459246
Title: EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning
Title（参考訳）: EXPLORE-Bench:ロングホライゾン推論によるエゴセントリックなシーン予測
Authors: Chengjun Yu, Xuhan Zhu, Chaoqun Du, Pengfei Yu, Wei Zhai, Yang Cao, Zheng-Jun Zha,
Abstract要約: 本研究では,多モーダルな言語モデルが,エゴセントリックな視点から行動の長期的物理的帰結を確実に推論できるかどうかを考察する。 EXPLORE-Benchは,様々なシナリオにまたがる実の1人称ビデオから算出したベンチマークである。プロプライエタリでオープンソースのMLLMの実験では、人間にとって大きなパフォーマンスギャップが示される。
参考スコア（独自算出の注目度）: 63.010793398283134
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from an egocentric viewpoint. We study this gap through a new task, Egocentric Scene Prediction with LOng-horizon REasoning: given an initial-scene image and a sequence of atomic action descriptions, a model is asked to predict the final scene after all actions are executed. To enable systematic evaluation, we introduce EXPLORE-Bench, a benchmark curated from real first-person videos spanning diverse scenarios. Each instance pairs long action sequences with structured final-scene annotations, including object categories, visual attributes, and inter-object relations, which supports fine-grained, quantitative assessment. Experiments on a range of proprietary and open-source MLLMs reveal a significant performance gap to humans, indicating that long-horizon egocentric reasoning remains a major challenge. We further analyze test-time scaling via stepwise reasoning and show that decomposing long action sequences can improve performance to some extent, while incurring non-trivial computational overhead. Overall, EXPLORE-Bench provides a principled testbed for measuring and advancing long-horizon reasoning for egocentric embodied perception.
Abstract（参考訳）: マルチモーダル・大規模言語モデル(MLLM)は、エンボディエージェントの基礎としてますます考えられているが、エゴセントリックな視点から行動の長期的な物理的影響を確実に説明できるかどうかは不明である。我々は,このギャップを,Long-Horizon Reasoningを用いたエゴセントリックシーン予測(Egocentric Scene Prediction with LOng-Horizon Reasoning: 初期シーン画像と原子アクション記述のシーケンスを与えられた場合,全てのアクションの実行後に最終シーンを予測するようモデルに依頼する。システム評価を実現するために,様々なシナリオにまたがる実1人ビデオから算出したベンチマークであるEXPLORE-Benchを導入する。各インスタンスは、オブジェクトカテゴリ、視覚属性、オブジェクト間の関係を含む構造化された最終シーンアノテーションと長いアクションシーケンスをペアリングする。プロプライエタリでオープンソースのMLLMの実験では、人間にとって大きなパフォーマンスギャップが示されており、長い水平自我中心の推論が依然として大きな課題であることを示している。さらに、ステップワイズ推論によるテスト時間スケーリングを解析し、長いアクションシーケンスを分解することで、計算オーバーヘッドを発生させることなく、ある程度パフォーマンスを向上できることを示す。全体として、EXPLORE-Benchは、エゴセントリックなエンボディード知覚のための長距離推論の測定と前進のための、原則化されたテストベッドを提供する。

論文の概要: EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

関連論文リスト