Fugu-MT 論文翻訳(概要): World Action Models: The Next Frontier in Embodied AI

論文の概要: World Action Models: The Next Frontier in Embodied AI

arxiv url: http://arxiv.org/abs/2605.12090v1
Date: Tue, 12 May 2026 13:10:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.876471
Title: World Action Models: The Next Frontier in Embodied AI
Title（参考訳）: 世界アクションモデル - エンボディードAIの次のフロンティア
Authors: Siyin Wang, Junhao Shi, Zhaoyang Fu, Xinzhe He, Feihong Liu, Chenchen Yang, Yikang Zhou, Zhaoye Fei, Jingjing Gong, Jinlan Fu, Mike Zheng Shou, Xuanjing Huang, Xipeng Qiu, Yu-Gang Jiang,
Abstract要約: VLA(Vision-Language-Action)モデルは、具体的政策学習のための強力なセマンティックな一般化を実現している。彼らは、物理的な世界が介入の下でどのように進化するかを明示的にモデル化することなく、リアクティブな観察から行動へのマッピングを学ぶ。成長するこの制限には、世界モデル、環境ダイナミクスの予測モデル、アクション生成パイプラインを統合することで対処する。
参考スコア（独自算出の注目度）: 123.5787299299832
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Vision-Language-Action (VLA) models have achieved strong semantic generalization for embodied policy learning, yet they learn reactive observation-to-action mappings without explicitly modeling how the physical world evolves under intervention. A growing body of work addresses this limitation by integrating world models, predictive models of environment dynamics, into the action generation pipeline. We term this emerging paradigm World Action Models (WAMs): embodied foundation models that unify predictive state modeling with action generation, targeting a joint distribution over future states and actions rather than actions alone. However, the literature remains fragmented across architectures, learning objectives, and application scenarios, lacking a unified conceptual framework. We formally define WAMs and disambiguate them from related concepts, and trace the foundations and early integration of VLA and world model research that gave rise to this paradigm. We organize existing methods into a structured taxonomy of Cascaded and Joint WAMs, with further subdivision by generation modality, conditioning mechanism, and action decoding strategy. We systematically analyze the data ecosystem fueling WAMs development, spanning robot teleoperation, portable human demonstrations, simulation, and internet-scale egocentric video, and synthesize emerging evaluation protocols organized around visual fidelity, physical commonsense, and action plausibility. Overall, this survey provides the first systematic account of the WAMs landscape, clarifies key architectural paradigms and their trade-offs, and identifies open challenges and future opportunities for this rapidly evolving field.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは、ポリシー学習を具体化するための強力なセマンティックな一般化を実現しているが、物理的な世界が介入の下でどのように進化するかを明示的にモデル化することなく、リアクティブな観察から行動へのマッピングを学習している。成長するこの制限には、世界モデル、環境ダイナミクスの予測モデル、アクション生成パイプラインを統合することで対処する。我々は、この新たなパラダイムである World Action Models (WAMs): アクション生成と予測状態モデリングを統一する基礎モデルを具現化したものであり、アクションのみではなく、将来の状態とアクションに対する共同分布をターゲットにしている。しかし、文献はアーキテクチャ、学習目的、アプリケーションシナリオで断片化され、統一された概念的なフレームワークが欠如している。我々は、WAMを正式に定義し、関連する概念と区別し、VLAと世界モデル研究の基礎と初期の統合を辿り、このパラダイムを生み出した。我々は,既存の手法をカスケードとジョイントWAMの構造分類に分類し,生成モダリティ,コンディショニング機構,行動復号戦略によりさらに細分化する。我々は、WAMの開発、ロボット遠隔操作、携帯型人間デモ、シミュレーション、インターネットスケールのエゴセントリックなビデオにまたがるデータエコシステムを体系的に分析し、視覚的忠実さ、身体的常識、行動の可視性に関する新たな評価プロトコルを合成する。全体として、この調査は、WAMの展望を初めて体系的に説明し、重要なアーキテクチャパラダイムとそのトレードオフを明らかにし、この急速に発展する分野におけるオープンな課題と将来の機会を特定します。

論文の概要: World Action Models: The Next Frontier in Embodied AI

関連論文リスト