Fugu-MT 論文翻訳(概要): Reward as An Agent for Embodied World Models

論文の概要: Reward as An Agent for Embodied World Models

arxiv url: http://arxiv.org/abs/2606.19990v1
Date: Thu, 18 Jun 2026 09:29:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-19 18:23:39.766144
Title: Reward as An Agent for Embodied World Models
Title（参考訳）: 身体的世界モデルのためのエージェントとしてのリワード
Authors: Pu Li, Zhigang Lin, Qiang Wu, Yongxuan Lv, Fei Wang, Shan You,
Abstract要約: 我々は、中核的な制限は探査そのものではなく、より広範な探査を支援するための信頼性の高い検証戦略の欠如であると主張している。 Reward as an Agent, an agentic reward framework that a generated behaviors to provide robust reward signal。また,DynDiff-GRPOによるダイナミック・アウェア・ロールアウト・ディバーシフィケーション(Dynamic-Aware Rollout Diversification)も導入した。
参考スコア（独自算出の注目度）: 26.825141454200686
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While RL has become a promising tool for refining world models, existing methods largely rely on conservative rollouts near the training distribution, limiting exploration, behavioral diversity, and richer dynamic discovery. In this work, we challenge this conservative paradigm. We argue that the core limitation is not exploration itself, but the lack of reliable verification strategies to support broader exploration. Without reliable verification, expanded exploration becomes highly susceptible to reward hacking, where policies exploit imperfect rewards without achieving genuine improvement. To evaluate this motivation, we instantiate our method in embodied world models, where physical plausibility, and task completion provide a rigorous testbed for scalable RL under complex dynamics. On the verification side, we introduce Reward as an Agent, an agentic reward framework that actively evaluates generated behaviors to provide robust reward signals and mitigate reward hacking under distribution shifts. On the exploration side, we introduce Dynamic-Aware Rollout Diversification through DynDiff-GRPO, which explicitly expands action-space exploration to diversify trajectories, broaden state-action coverage, and encourage richer embodied behaviors beyond conservative rollout regimes. By unifying Reward as an Agent with DynDiff-GRPO, we enable RL on a more reliable reward foundation with substantially diversified sampling, effectively mitigating reward hacking while yielding significant accuracy gains across multiple open-source world models, thereby demonstrating that broader exploration can scale successfully when grounded in robust verification.
Abstract（参考訳）: RLは世界モデルを精錬するための有望なツールとなっているが、既存の手法はトレーニング分布近くの保守的なロールアウト、探索の制限、行動多様性、よりリッチな動的発見に大きく依存している。この研究では、この保守的なパラダイムに挑戦する。我々は、中核的な制限は探査そのものではなく、より広範な探査を支援するための信頼性の高い検証戦略の欠如であると主張している。信頼性の高い検証がなければ、拡張された探索は、真の改善を達成せずに不完全な報酬を悪用するハッキングに対して非常に影響を受けやすいものとなる。このモチベーションを評価するために, 複雑な力学下でのスケーラブルなRLのための厳密なテストベッドとして, 物理的妥当性, タスク完了が提供される, 具体化された世界モデルで本手法をインスタンス化する。検証面では,エージェント・アズ・ア・エージェント(エージェント・アズ・エージェント)を導入する。エージェント・アズ・ア・エージェント(エージェント・ア・エージェント)は,エージェント・アズ・ア・エージェント(エージェント・アズ・エージェント)で,エージェント・アズ・ア・エージェント(エージェント・ア・エージェント)の動作を積極的に評価し,ロバストな報酬信号を提供する。探索面では、DynDiff-GRPOによるダイナミック・アウェア・ロールアウトの多様化を導入し、行動空間の探索を明示的に拡張し、軌跡を多様化し、国家の行動範囲を広げ、保守的なロールアウト体制を超えてより豊かな実施行動を奨励する。エージェントとしてのRewardをDynDiff-GRPOと統一することにより、RLはより信頼性の高い報奨基盤上で、かなり多様なサンプリングを行い、複数のオープンソースワールドモデルに対して大幅な精度向上を達成しつつ、報酬ハッキングを効果的に軽減し、ロバストな検証において広範な探索が成功できることを実証する。

論文の概要: Reward as An Agent for Embodied World Models

関連論文リスト