Policy Representation via Diffusion Probability Model for Reinforcement
Learning
- URL: http://arxiv.org/abs/2305.13122v1
- Date: Mon, 22 May 2023 15:23:41 GMT
- Title: Policy Representation via Diffusion Probability Model for Reinforcement
Learning
- Authors: Long Yang, Zhixiong Huang, Fenghao Lei, Yucun Zhong, Yiming Yang, Cong
Fang, Shiting Wen, Binbin Zhou, Zhouchen Lin
- Abstract summary: We build a theoretical foundation of policy representation via the diffusion probability model.
We present a convergence guarantee for diffusion policy, which provides a theory to understand the multimodality of diffusion policy.
We propose the DIPO which is an implementation for model-free online RL with DIffusion POlicy.
- Score: 67.56363353547775
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Popular reinforcement learning (RL) algorithms tend to produce a unimodal
policy distribution, which weakens the expressiveness of complicated policy and
decays the ability of exploration. The diffusion probability model is powerful
to learn complicated multimodal distributions, which has shown promising and
potential applications to RL. In this paper, we formally build a theoretical
foundation of policy representation via the diffusion probability model and
provide practical implementations of diffusion policy for online model-free RL.
Concretely, we character diffusion policy as a stochastic process, which is a
new approach to representing a policy. Then we present a convergence guarantee
for diffusion policy, which provides a theory to understand the multimodality
of diffusion policy. Furthermore, we propose the DIPO which is an
implementation for model-free online RL with DIffusion POlicy. To the best of
our knowledge, DIPO is the first algorithm to solve model-free online RL
problems with the diffusion model. Finally, extensive empirical results show
the effectiveness and superiority of DIPO on the standard continuous control
Mujoco benchmark.
Related papers
- Understanding Reinforcement Learning-Based Fine-Tuning of Diffusion Models: A Tutorial and Review [63.31328039424469]
This tutorial provides a comprehensive survey of methods for fine-tuning diffusion models to optimize downstream reward functions.
We explain the application of various RL algorithms, including PPO, differentiable optimization, reward-weighted MLE, value-weighted sampling, and path consistency learning.
arXiv Detail & Related papers (2024-07-18T17:35:32Z) - Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient [26.675822002049372]
Deep Diffusion Policy Gradient (DDiffPG) is a novel actor-critic algorithm that learns from scratch multimodal policies.
DDiffPG forms a multimodal training batch and utilizes mode-specific Q-learning to mitigate the inherent greediness of the RL objective.
Our approach further allows the policy to be conditioned on mode-specific embeddings to explicitly control the learned modes.
arXiv Detail & Related papers (2024-06-02T09:32:28Z) - Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality.
We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO)
Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions.
We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z) - Diffusion Actor-Critic with Entropy Regulator [32.79341490514616]
We propose an online RL algorithm termed diffusion actor-critic with entropy regulator (DACER)
This algorithm conceptualizes the reverse process of the diffusion model as a novel policy function.
Experiments on MuJoCo benchmarks and a multimodal task demonstrate that the DACER algorithm achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-05-24T03:23:27Z) - Consistency Models as a Rich and Efficient Policy Class for Reinforcement Learning [25.81859481634996]
Score-based generative models like the diffusion model have been testified to be effective in modeling multi-modal data from image generation to reinforcement learning (RL)
We propose to apply the consistency model as an efficient yet expressive policy representation, namely consistency policy, with an actor-critic style algorithm for three typical RL settings.
arXiv Detail & Related papers (2023-09-29T05:05:54Z) - Diffusion Policies as an Expressive Policy Class for Offline
Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset.
We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy.
We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z) - Regularizing a Model-based Policy Stationary Distribution to Stabilize
Offline Reinforcement Learning [62.19209005400561]
offline reinforcement learning (RL) extends the paradigm of classical RL algorithms to purely learning from static datasets.
A key challenge of offline RL is the instability of policy training, caused by the mismatch between the distribution of the offline data and the undiscounted stationary state-action distribution of the learned policy.
We regularize the undiscounted stationary distribution of the current policy towards the offline data during the policy optimization process.
arXiv Detail & Related papers (2022-06-14T20:56:16Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.