Fugu-MT 論文翻訳(概要): How Good are Foundation Models in Step-by-Step Embodied Reasoning?

論文の概要: How Good are Foundation Models in Step-by-Step Embodied Reasoning?

arxiv url: http://arxiv.org/abs/2509.15293v1
Date: Thu, 18 Sep 2025 17:56:30 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-22 18:18:10.852928
Title: How Good are Foundation Models in Step-by-Step Embodied Reasoning?
Title（参考訳）: ステップ・バイ・ステップ・エボダイド推論における基礎モデルはどの程度優れているか?
Authors: Dinura Dissanayake, Ahmed Heakl, Omkar Thawakar, Noor Ahsan, Ritesh Thawkar, Ketan More, Jean Lahoud, Rao Anwer, Hisham Cholakkal, Ivan Laptev, Fahad Shahbaz Khan, Salman Khan,
Abstract要約: 身体的エージェントは、安全で空間的に整合性があり、文脈に根ざした決定をしなければならない。大規模マルチモーダルモデルの最近の進歩は、視覚的理解と言語生成において有望な能力を示している。私たちのベンチマークには、10のタスクと8のエボディメントにまたがる詳細なステップバイステップ推論を備えた1.1k以上のサンプルが含まれています。
参考スコア（独自算出の注目度）: 79.15268080287505
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Embodied agents operating in the physical world must make decisions that are not only effective but also safe, spatially coherent, and grounded in context. While recent advances in large multimodal models (LMMs) have shown promising capabilities in visual understanding and language generation, their ability to perform structured reasoning for real-world embodied tasks remains underexplored. In this work, we aim to understand how well foundation models can perform step-by-step reasoning in embodied environments. To this end, we propose the Foundation Model Embodied Reasoning (FoMER) benchmark, designed to evaluate the reasoning capabilities of LMMs in complex embodied decision-making scenarios. Our benchmark spans a diverse set of tasks that require agents to interpret multimodal observations, reason about physical constraints and safety, and generate valid next actions in natural language. We present (i) a large-scale, curated suite of embodied reasoning tasks, (ii) a novel evaluation framework that disentangles perceptual grounding from action reasoning, and (iii) empirical analysis of several leading LMMs under this setting. Our benchmark includes over 1.1k samples with detailed step-by-step reasoning across 10 tasks and 8 embodiments, covering three different robot types. Our results highlight both the potential and current limitations of LMMs in embodied reasoning, pointing towards key challenges and opportunities for future research in robot intelligence. Our data and code will be made publicly available.
Abstract（参考訳）: 物理的世界で活動する身体的エージェントは、効果的であるだけでなく、安全であり、空間的に一貫性があり、文脈に根ざした決定をしなければならない。大規模マルチモーダルモデル(LMM)の最近の進歩は、視覚的理解と言語生成において有望な能力を示しているが、実世界の具体的タスクに対する構造化推論を行う能力は、まだ未定である。本研究では, 基礎モデルが具体的環境において, ステップバイステップの推論をいかにうまく行うかを理解することを目的とする。そこで我々は,複雑な具体的意思決定シナリオにおけるLMMの推論能力を評価するために,FoMER(Foundation Model Embodied Reasoning)ベンチマークを提案する。我々のベンチマークは、エージェントがマルチモーダルな観察を解釈し、物理的制約と安全性を推論し、自然言語で有効な次のアクションを生成することを要求する様々なタスクにまたがっている。特集にあたって (i)具体的推論作業の大規模でキュレートされたスイート。二行動推論から知覚的根拠を乱す新たな評価枠組み三この条件下でのいくつかの主要なLMMの実証分析私たちのベンチマークには、10のタスクと8つのエボディメントにわたる詳細なステップバイステップ推論を備えた1.1k以上のサンプルが含まれており、3つの異なるロボットタイプをカバーしています。本研究は,ロボット・インテリジェンスにおけるLMMの可能性と現状の限界を両立させ,今後の研究の鍵となる課題と機会を指摘するものである。私たちのデータとコードは公開されます。

論文の概要: How Good are Foundation Models in Step-by-Step Embodied Reasoning?

関連論文リスト