Fugu-MT 論文翻訳(概要): Causal Reward World Models: Zero-shot Reward Design for Automated Skill Generation

論文の概要: Causal Reward World Models: Zero-shot Reward Design for Automated Skill Generation

arxiv url: http://arxiv.org/abs/2606.23280v1
Date: Mon, 22 Jun 2026 12:57:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 22:54:05.457574
Title: Causal Reward World Models: Zero-shot Reward Design for Automated Skill Generation
Title（参考訳）: 因果リワード世界モデル:自動スキル生成のためのゼロショットリワード設計
Authors: Yang Yang, Yuchuang Tong, Zhengtao Zhang, Xu Ding, Ning Yang, Yifan Zhang, Haipeng Li, Kehu Yang, Miao Xin,
Abstract要約: Automated Reward Design (ARD)は、強化学習における手動報酬工学を言語駆動報酬関数合成に置き換えることを目的としている。大規模言語モデル(LLM)に基づく既存のアプローチは、特定のタスクごとに報酬仮説を洗練させるために反復的な環境フィードバックに依存し、本質的に相関駆動である。マルチタスクインタラクションデータに基づくオフライン事前学習により,候補の報酬成分とタスク対象の物理変数の因果関係を明示的にモデル化する因果関係世界モデル(CRWM)を提案する。
参考スコア（独自算出の注目度）: 23.371552518874807
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automated Reward Design (ARD) aims to replace manual reward engineering in reinforcement learning with language-driven reward function synthesis. However, existing approaches based on large language models (LLMs) remain inherently correlation-driven, relying on iterative environmental feedback to refine reward hypotheses for each specific task. This paradigm not only results in inefficient reasoning but also makes LLMs susceptible to semantically plausible yet causally spurious reward components, leading to ineffective optimization. To address these limitations, we propose the Causal Reward World Model (CRWM), which explicitly models the causal topological relationships between candidate reward components and task-targeted physical variables through offline pre-training on multi-task interaction data. Based on a coarse-to-fine pre-training strategy, we introduce a joint optimization module that integrates Explicit Mechanism Decoupling with Confidence-Aware Soft Fusion to refine coarse structural priors using micro-level trajectories, thereby constructing a robust and interpretable causal skeleton. During inference, LLMs leverage CRWM as a task-irrelevant causal prior to constrain the reward generation, enabling zero-shot reward function design. Our work opens up a new white-box paradigm for the ARD problem. Extensive experiments on complex continuous control benchmarks demonstrate that CRWM generates executable reward functions without feedback-driven reward refinement, significantly reducing the design latency for acquiring new robotic skills while matching or surpassing state-of-the-art performance, and further exhibits strong generalization capabilities across unseen tasks and diverse robotic embodiments.
Abstract（参考訳）: Automated Reward Design (ARD)は、強化学習における手動報酬工学を言語駆動報酬関数合成に置き換えることを目的としている。しかし、大規模言語モデル(LLM)に基づく既存のアプローチは、特定のタスクごとに報酬仮説を洗練させるために反復的な環境フィードバックに依存し、本質的に相関駆動である。このパラダイムは、非効率な推論をもたらすだけでなく、意味論的に証明できるが因果的に刺激的な報酬成分に影響を受けやすくし、非効率な最適化に繋がる。これらの制約に対処するために,マルチタスクインタラクションデータを用いたオフライン事前学習により,候補報酬成分とタスク対象物理変数の因果的トポロジ的関係を明示的にモデル化する因果リワード世界モデル(CRWM)を提案する。粗大から細大までの事前学習戦略に基づき,信頼度を意識したソフトフュージョンと疎結合した共同最適化モジュールを導入し,マイクロレベル軌道を用いた粗大な構造先行を洗練し,頑健で解釈可能な因果骨格を構築する。推論中、LLMは報酬生成を制約する前にCRWMをタスク非関連因果として利用し、ゼロショット報酬関数の設計を可能にする。我々の研究は、ARD問題に対する新しいホワイトボックスパラダイムを開きます。複雑な連続制御ベンチマークに関する大規模な実験は、CRWMがフィードバック駆動による報酬改善を伴わずに実行可能な報酬関数を生成することを示し、最先端のパフォーマンスに適合または超越しながら、新しいロボットスキルを得るための設計遅延を著しく低減し、また、目に見えないタスクや多様なロボットの実施形態にわたって強力な一般化能力を示す。

論文の概要: Causal Reward World Models: Zero-shot Reward Design for Automated Skill Generation

関連論文リスト