Fugu-MT 論文翻訳(概要): PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments

論文の概要: PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments

arxiv url: http://arxiv.org/abs/2605.11534v1
Date: Tue, 12 May 2026 04:59:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.593601
Title: PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments
Title（参考訳）: PRISM: : 模擬身体環境におけるインテントの計画と推論
Authors: Yunn Kang Lim, Pengzhan Sun, Ziyi Bai, Xun Xu, Angela Yao, Xulei Yang, Shijie Li,
Abstract要約: 5つの集合住宅の上に建設され、PRISMは300の人間認証タスクを3つの能力レベルに構成する。 PRISMはエージェントに依存しない実行可能なアクションAPIを公開し、任意のエージェントをエンドツーエンドで評価できるようにする。
参考スコア（独自算出の注目度）: 59.07829883257003
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: When an LLM-based embodied agent fails at a household task, the culprit could be misidentified objects, forgotten sub-goals, or poor action sequencing -- yet existing benchmarks report only a single success rate, making it impossible to tell which cognitive module is responsible. We present PRISM, a diagnostic benchmark that reframes this problem: rather than asking only \textit{did the agent succeed?}, PRISM asks \textit{which capability is most likely responsible for failure?} Built on five photorealistic multi-room apartments (4--8 rooms each), PRISM structures 300 human-verified tasks into three capability tiers -- \textit{Basic Ability}, \textit{Reasoning Ability}, and \textit{Long-horizon Ability} -- that isolate perception-to-action grounding, implicit intent resolution, and sustained multi-step coordination respectively. PRISM exposes an agent-agnostic executable action API that allows arbitrary agents: LLM agents, VLM agents, symbolic planners, RL policies, and hybrid systems, to be evaluated end-to-end under the same benchmark protocol. To support deeper diagnosis, optional probes for perception, memory, and planning can be adopted, replaced, or bypassed entirely, enabling controlled component-level analysis when desired. Experiments on seven contemporary LLMs establish a clear hierarchy: explicit spatial grounding is not the dominant failure source under oracle perception, implicit intent resolution is a significant bottleneck for all model families, and long-horizon coordination exposes a stark capability cliff -- lightweight models collapse to as low as 20.0\% success while simultaneously consuming more tokens than their frontier counterparts, a signature of compensatory over-reasoning rather than genuine planning capability. Project page: \href{https://sj-li.com/PROJ/PRISM}{link}.
Abstract（参考訳）: LLMベースのエンボディエージェントが家庭のタスクで失敗した場合、犯人は誤って特定されたオブジェクト、忘れられたサブゴール、あるいはアクションシークエンシングが不足している可能性がある。 PRISMは、この問題を再設計する診断ベンチマークである。 PRISM は \textit{ which capabilities is likely responsible to failure? フォトリアリスティックな5つの集合住宅(それぞれ4～8室)上に構築され、PRISM構造体300個の人間検証されたタスクを3つの能力レベル - \textit{Basic Ability}, \textit{Reasoning Ability}, \textit{Long-Horizon Ability} - に分割し、それぞれ認識から行動への接地、暗黙の意図解決、持続的な多段階調整を行う。 PRISMはエージェントに依存しない実行可能なアクションAPIを公開しており、任意のエージェント(LLMエージェント、VLMエージェント、シンボリックプランナー、RLポリシー、ハイブリッドシステム)を同じベンチマークプロトコルでエンドツーエンドに評価することができる。より深い診断を支援するために、知覚、記憶、計画のためのオプションプローブを採用、置き換え、あるいは完全にバイパスすることができ、必要に応じて制御されたコンポーネントレベルの分析を可能にする。 7つの現代のLCMの実験は明確な階層を確立している: 明示的な空間的接地は、オラクルの知覚の下では主要な失敗源ではない暗黙の意図の解決は、すべてのモデルファミリーにとって重要なボトルネックであり、長い水平方向の調整は、スターク能力の崖を露呈する -- 軽量モデルが20.0\%の成功まで崩壊すると同時に、フロンティアよりも多くのトークンを消費すると同時に、真の計画能力よりも補償過剰な推論のサインである。プロジェクトページ: \href{https://sj-li.com/PROJ/PRISM}{link}。

論文の概要: PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments

関連論文リスト