Fugu-MT 論文翻訳(概要): Advantage-Guided Diffusion for Model-Based Reinforcement Learning

論文の概要: Advantage-Guided Diffusion for Model-Based Reinforcement Learning

arxiv url: http://arxiv.org/abs/2604.09035v1
Date: Fri, 10 Apr 2026 06:53:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-13 17:57:53.734047
Title: Advantage-Guided Diffusion for Model-Based Reinforcement Learning
Title（参考訳）: モデルに基づく強化学習のためのアドバンテージガイド付き拡散
Authors: Daniele Foffano, Arvid Eriksson, David Broman, Karl H. Johansson, Alexandre Proutiere,
Abstract要約: MBRLのためのアドバンテージ誘導拡散は、エージェントの利点推定を用いて逆拡散過程を制御する。本稿では,AGD-MBRLから発生する軌道が,非誘導拡散モデルと比較して改善された方針に従うことを示す。
参考スコア（独自算出の注目度）: 38.18017161791996
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Model-based reinforcement learning (MBRL) with autoregressive world models suffers from compounding errors, whereas diffusion world models mitigate this by generating trajectory segments jointly. However, existing diffusion guides are either policy-only, discarding value information, or reward-based, which becomes myopic when the diffusion horizon is short. We introduce Advantage-Guided Diffusion for MBRL (AGD-MBRL), which steers the reverse diffusion process using the agent's advantage estimates so that sampling concentrates on trajectories expected to yield higher long-term return beyond the generated window. We develop two guides: (i) Sigmoid Advantage Guidance (SAG) and (ii) Exponential Advantage Guidance (EAG). We prove that a diffusion model guided through SAG or EAG allows us to perform reweighted sampling of trajectories with weights increasing in state-action advantage-implying policy improvement under standard assumptions. Additionally, we show that the trajectories generated from AGD-MBRL follow an improved policy (that is, with higher value) compared to an unguided diffusion model. AGD integrates seamlessly with PolyGRAD-style architectures by guiding the state components while leaving action generation policy-conditioned, and requires no change to the diffusion training objective. On MuJoCo control tasks (HalfCheetah, Hopper, Walker2D and Reacher), AGD-MBRL improves sample efficiency and final return over PolyGRAD, an online Diffuser-style reward guide, and model-free baselines (PPO/TRPO), in some cases by a margin of 2x. These results show that advantage-aware guidance is a simple, effective remedy for short-horizon myopia in diffusion-model MBRL.
Abstract（参考訳）: 自己回帰的世界モデルを用いたモデルベース強化学習(MBRL)は複合的エラーに悩まされる一方、拡散的世界モデルは軌道セグメントを共同生成することでこれを緩和する。しかし、既存の拡散ガイドはポリシーのみであり、価値情報を捨てるか、あるいは報酬ベースであり、拡散地平線が短いとミオピックになる。本稿では,MBRL (Advantage-Guided Diffusion for MBRL) について紹介する。これはエージェントの利点推定値を用いて逆拡散過程を制御し,サンプリングが生成したウィンドウを超える長期的リターンを期待できる軌跡に集中できるようにする。私たちは2つのガイドを開発します。 (i)シグモイド・アドバンテージ・ガイダンス(SAG)及び (II)指数アドバンテージガイダンス(EAG) SAG や EAG を通じて導かれる拡散モデルにより、標準的な仮定の下での状態-作用の有利な政策改善の重み付けによりトラジェクトリの再加重サンプリングを行うことができることを示す。さらに, AGD-MBRL から発生する軌道は, 誘導拡散モデルと比較して, 改良されたポリシー(すなわち, 高い値)に従うことを示す。 AGDは、アクション生成ポリシーを条件に残しながら状態コンポーネントを誘導することで、PolyGRADスタイルのアーキテクチャとシームレスに統合する。 MuJoCoコントロールタスク(HalfCheetah、Hopper、Walker2D、Reacher)では、AGD-MBRLは、オンラインディフューザースタイルの報酬ガイドであるPolyGRADよりもサンプル効率と最終的なリターンを改善し、場合によってはモデルフリーベースライン(PPO/TRPO)を2倍に改善する。これらの結果から,拡散モデルMBRLにおける短軸ミオピアに対するアドバンテージ・アウェア・ガイダンスは,簡便かつ効果的な治療法であることが示唆された。

論文の概要: Advantage-Guided Diffusion for Model-Based Reinforcement Learning

関連論文リスト