Fugu-MT 論文翻訳(概要): Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance

論文の概要: Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance

arxiv url: http://arxiv.org/abs/2605.30056v1
Date: Thu, 28 May 2026 15:07:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 00:00:30.958741
Title: Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance
Title（参考訳）: 批判的指導によるサンプル効率のよい拡散型強化学習
Authors: Shutong Ding, Zejia Zhong, Zhongyi Wang, Ke Hu, Bikang Pan, Jingya Wang, Ye Shi,
Abstract要約: CGPO, textbfCritic-textbfGuided diffusion textbfPolicy textbfOptimizationを提案する。 CGPOは、批評家ネットワークによって定義された高価値領域に対してアクション生成を制御し、回帰目的として誘導されたアクションを使用する。 5つのMuJoCo移動タスクにおけるCGPOの有効性を検証し,既存の拡散型RL法と比較してCGPOが最先端性能を達成することを示す。
参考スコア（独自算出の注目度）: 38.06932977050757
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in reinforcement learning (RL) have achieved great successes by leveraging the multimodality and exploration capability of diffusion policies. Among these approaches, one representative branch focuses on the sampling-based policy optimization. This design enables better exploration capability of the diffusion model, particularly at the beginning of training, but suffer from low exploitation in Q-value information, resulting in a slow policy convergence. Another branch pays attention to gradient-based policy optimization, which sufficiently exploits the gradient of the Q function yet tends to collapse into a unimodal policy with low diversity. To address this issue, we propose CGPO, \textbf{C}ritic-\textbf{G}uided diffusion \textbf{P}olicy \textbf{O}ptimization, which effectively balances exploration and exploitation with the training-free guidance technique integrated into the denoising process of diffusion policy. Concretely, CGPO steers action generation toward high-value regions defined by the critic network and uses the guided actions as regression objectives. In this manner, CGPO reduces the time required to obtain high-quality actions and improves final performance with better balance between the exploration-exploitation tradeoff. We validate the effectiveness of CGPO on 5 MuJoCo locomotion tasks, and CGPO achieves state-of-the-art performance compared with existing diffusion-based RL methods. Notably, CGPO is the first success to incorporate diffusion policy into real-world RL, with its superior performance on Franka robot arm grasping tasks. Our official page is released at https://dingsht.tech/cgpo-webpage.
Abstract（参考訳）: 近年の強化学習(RL)は,拡散政策の多モード性と探索能力を活用して大きな成功を収めている。これらのアプローチの中で、ある代表的ブランチはサンプリングベースのポリシー最適化に焦点を当てている。この設計は、特に訓練の開始時に拡散モデルのより良い探索能力を実現するが、Q値情報の少ない利用に悩まされ、政策収束が遅くなる。別の分枝は勾配に基づく政策最適化に注意を払っており、これはQ関数の勾配を十分に活用するが、多様性の低い単調な政策に崩壊する傾向がある。この問題に対処するために,CGPO, \textbf{C}ritic-\textbf{G}uidedfusion \textbf{P}olicy \textbf{O}ptimizationを提案する。具体的には、CGPOは、批評家ネットワークが定義した高価値領域に対してアクション生成を制御し、ガイドされたアクションを回帰目的として利用する。このようにして、CGPOは高品質な行動を得るのに必要な時間を短縮し、探索と探索のトレードオフのバランスを良くして最終性能を向上させる。 5つのMuJoCo移動タスクにおけるCGPOの有効性を検証し,既存の拡散型RL法と比較してCGPOが最先端性能を達成することを示す。特にCGPOは、拡散ポリシーを現実世界のRLに組み込んだ最初の成功であり、フランカロボットアームの把握タスクにおいて優れた性能を発揮している。公式ページはhttps://dingsht.tech/cgpo-webpage.comで公開されている。

論文の概要: Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance

関連論文リスト