Fugu-MT 論文翻訳(概要): World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

論文の概要: World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

arxiv url: http://arxiv.org/abs/2605.19957v1
Date: Tue, 19 May 2026 15:10:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.465229
Title: World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks
Title（参考訳）: ハイブリッド・エンボディード・タスクにおける長期進化のワールド・エゴ・モデリング
Authors: Zuyao Lin, Jianhui Zhang, Peidong Jia, Xiaoguang Zhao, Shanghang Zhang, Xingyu Chen,
Abstract要約: emphWorld-Ego Modelingは、未来の世界とエゴコンポーネントへの進化を分解する新しい概念パラダイムである。我々は、このパラダイムを、暗黙の分離したワールド・エゴ・プランナーとカスケード・パラレル・ミックス・オブ・エキスパート(CP-MoE)拡散生成器を結合した統一的な世界モデルであるワールド・エゴ・モデル(WEM)としてインスタンス化する。
参考スコア（独自算出の注目度）: 62.389116510844445
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: World models are widely explored in embodied intelligence, yet they typically predict distinct evolutions of the world and the ego within a single stream, where the world captures persistent instruction-agnostic scene regularities and the ego captures robot-centric instruction-conditioned dynamics. This world-ego entanglement leads to a degradation in long-horizon embodied scenarios, particularly in hybrid tasks with interleaved navigation and manipulation behaviors. In this paper, we introduce \emph{World-Ego Modeling}, a new conceptual paradigm that decomposes future evolution into world and ego components. We define the world-ego boundary from three perspectives, i.e., motion-, semantic-, and intention-based views, and analyze three disentanglement strategies with post-, pre-, and full disentanglement. Further, we instantiate this paradigm as the World-Ego Model (WEM), a unified embodied world model that couples an implicit separate world-ego planner with a cascade-parallel mixture-of-experts (CP-MoE) diffusion generator. To enable rigorous evaluation, we further construct HTEWorld, the first benchmark for long-horizon world modeling with hybrid navigation-manipulation tasks, providing 125K video clips (over 4.5M frames) with fine-grained action annotations and 300 multi-turn evaluation trajectories (over 2K instructions). Extensive experiments show that WEM achieves state-of-the-art performance on HTEWorld while remaining competitive on existing manipulation-only benchmarks.
Abstract（参考訳）: 世界モデルはインボディード・インテリジェンス(英語版)で広く研究されているが、通常は1つのストリーム内の世界とエゴの異なる進化を予測し、世界は永続的な命令に依存しないシーンの規則性を捉え、エゴはロボット中心の命令条件のダイナミクスを捉えている。このワールド・エゴの絡み合いは、特にインターリーブされたナビゲーションと操作行動を持つハイブリッドタスクにおいて、長い水平な実施シナリオを悪化させる。本稿では,世界とエゴへの未来進化を分解する新しい概念パラダイムである「emph{World-Ego Modeling}」を紹介する。我々は,3つの視点,すなわち動作,意味,意図に基づく視点から世界・エゴ境界を定義し,ポスト・プレ・フル・アンタングルによる3つのアンタングルメント戦略を解析する。さらに,このパラダイムを,暗黙の分離したワールド・エゴ・プランナーとカスケード・パラレル・ミックス・オブ・エキスパート(CP-MoE)拡散生成器を結合した統一的な世界モデルであるワールド・エゴ・モデル(WEM)としてインスタンス化する。より厳密な評価を可能にするため,HTEWorldはハイブリッドナビゲーション操作タスクを用いた長距離世界モデリングのための最初のベンチマークであり,125Kビデオクリップ(4.5Mフレーム以上)に詳細なアクションアノテーションと300のマルチターン評価トラジェクトリ(2K命令以上)を提供する。大規模な実験により、WEMは既存の操作のみのベンチマークで競争力を維持しながら、HTEWorld上で最先端のパフォーマンスを達成することが示された。

論文の概要: World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

関連論文リスト