Fugu-MT 論文翻訳(概要): WAM-RL: World-Action Model Reinforcement Learning with Reconstruction Rewards and Online Video SFT

論文の概要: WAM-RL: World-Action Model Reinforcement Learning with Reconstruction Rewards and Online Video SFT

arxiv url: http://arxiv.org/abs/2606.17906v1
Date: Tue, 16 Jun 2026 13:29:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-17 17:15:32.450098
Title: WAM-RL: World-Action Model Reinforcement Learning with Reconstruction Rewards and Online Video SFT
Title（参考訳）: WAM-RL:リコンストラクション・リワードとオンラインビデオSFTを用いたワールド・アクション・モデル強化学習
Authors: Zezhong Qian, Xiaowei Chi, Yu Qi, Haozhan Li, Zhi Yang Chen, Shanghang Zhang,
Abstract要約: World-Action(WA)モデルは強力な一般化能力とデータ効率を示す。 WAM-RLは世界モデルとアクションモデルの協調最適化を可能にする強化学習フレームワークである。私たちの研究は、World-Actionパラダイムに強化学習を導入する最初のものです。
参考スコア（独自算出の注目度）: 42.80852706784868
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent World-Action (WA) models demonstrate strong generalization ability and data efficiency, but they typically rely on expert trajectories for training. This reliance limits their ability to acquire fine-grained manipulation skills beyond the demonstration distribution and prevents them from continuously improving through real-world interaction. To address these limitations, we propose WAM-RL, a reinforcement learning framework that enables joint optimization of the world model and the action model through online interaction with the environment. By allowing the two components to co-evolve, our approach enhances fine-grained control and adaptability. Specifically, a WA model consists of a world model and an actor. We design a tailored reinforcement learning method with hierarchical optimization to coordinate their improvement. On the methodological side, we systematically investigate the effects of applying reinforcement learning to the action model, as well as online training of the world model within an RL setting. Our experiments reveal a key insight: optimizing only the actor yields improvements on short-horizon tasks, but fails to provide significant gains on long-horizon tasks. In contrast, jointly optimizing both the world model and the actor is critical for achieving strong performance in long-horizon settings. Our work is the first to introduce reinforcement learning into the World-Action paradigm, and provides insights into how online optimization of both the action head and the world model impacts overall performance.
Abstract（参考訳）: 近年のWorld-Action(WA)モデルは、強力な一般化能力とデータ効率を示すが、訓練には専門家の軌道に依存するのが一般的である。この依存は、実世界のインタラクションを通じて、詳細な操作スキルを得る能力を制限し、継続的に改善することを防ぐ。これらの制約に対処するために,世界モデルと行動モデルの協調最適化を可能にする強化学習フレームワークであるWAM-RLを提案する。この2つのコンポーネントを共進化させることで、細粒度制御と適応性を高めることができる。具体的には、WAモデルはワールドモデルとアクターから構成される。階層的な最適化を施した強化学習手法を設計し,その改善を調整した。提案手法は,アクションモデルに強化学習を適用することの効果と,RL設定における世界モデルのオンライントレーニングを系統的に検討する。我々の実験では、アクターのみを最適化することで、短距離タスクの改善が達成されるが、長距離タスクでは大きな改善が得られない、という重要な洞察が浮かび上がっている。対照的に、世界モデルと俳優の双方を協調的に最適化することは、ロングホライゾン環境での強いパフォーマンスを達成するために重要である。私たちの研究は、ワールド・アクションのパラダイムに強化学習を導入し、アクションヘッドとワールドモデルのオンライン最適化が全体的なパフォーマンスに与える影響についての洞察を提供する最初のものです。

論文の概要: WAM-RL: World-Action Model Reinforcement Learning with Reconstruction Rewards and Online Video SFT

関連論文リスト