Diffusion Policies as an Expressive Policy Class for Offline
Reinforcement Learning
- URL: http://arxiv.org/abs/2208.06193v3
- Date: Fri, 25 Aug 2023 19:39:32 GMT
- Title: Diffusion Policies as an Expressive Policy Class for Offline
Reinforcement Learning
- Authors: Zhendong Wang, Jonathan J Hunt, Mingyuan Zhou
- Abstract summary: Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset.
We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy.
We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
- Score: 70.20191211010847
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline reinforcement learning (RL), which aims to learn an optimal policy
using a previously collected static dataset, is an important paradigm of RL.
Standard RL methods often perform poorly in this regime due to the function
approximation errors on out-of-distribution actions. While a variety of
regularization methods have been proposed to mitigate this issue, they are
often constrained by policy classes with limited expressiveness that can lead
to highly suboptimal solutions. In this paper, we propose representing the
policy as a diffusion model, a recent class of highly-expressive deep
generative models. We introduce Diffusion Q-learning (Diffusion-QL) that
utilizes a conditional diffusion model to represent the policy. In our
approach, we learn an action-value function and we add a term maximizing
action-values into the training loss of the conditional diffusion model, which
results in a loss that seeks optimal actions that are near the behavior policy.
We show the expressiveness of the diffusion model-based policy, and the
coupling of the behavior cloning and policy improvement under the diffusion
model both contribute to the outstanding performance of Diffusion-QL. We
illustrate the superiority of our method compared to prior works in a simple 2D
bandit example with a multimodal behavior policy. We then show that our method
can achieve state-of-the-art performance on the majority of the D4RL benchmark
tasks.
Related papers
- Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient [26.675822002049372]
Deep Diffusion Policy Gradient (DDiffPG) is a novel actor-critic algorithm that learns from scratch multimodal policies.
DDiffPG forms a multimodal training batch and utilizes mode-specific Q-learning to mitigate the inherent greediness of the RL objective.
Our approach further allows the policy to be conditioned on mode-specific embeddings to explicitly control the learned modes.
arXiv Detail & Related papers (2024-06-02T09:32:28Z) - Diffusion Policies creating a Trust Region for Offline Reinforcement Learning [66.17291150498276]
We introduce a dual policy approach, Diffusion Trusted Q-Learning (DTQL), which comprises a diffusion policy for pure behavior cloning and a practical one-step policy.
DTQL eliminates the need for iterative denoising sampling during both training and inference, making it remarkably computationally efficient.
We show that DTQL could not only outperform other methods on the majority of the D4RL benchmark tasks but also demonstrate efficiency in training and inference speeds.
arXiv Detail & Related papers (2024-05-30T05:04:33Z) - Preferred-Action-Optimized Diffusion Policies for Offline Reinforcement Learning [19.533619091287676]
We propose a novel preferred-action-optimized diffusion policy for offline reinforcement learning.
In particular, an expressive conditional diffusion model is utilized to represent the diverse distribution of a behavior policy.
Experiments demonstrate that the proposed method provides competitive or superior performance compared to previous state-of-the-art offline RL methods.
arXiv Detail & Related papers (2024-05-29T03:19:59Z) - Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality.
We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO)
Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions.
We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z) - Score Regularized Policy Optimization through Diffusion Behavior [25.926641622408752]
Recent developments in offline reinforcement learning have uncovered the immense potential of diffusion modeling.
We propose to extract an efficient deterministic inference policy from critic models and pretrained diffusion behavior models.
Our method boosts action sampling speed by more than 25 times compared with various leading diffusion-based methods in locomotion tasks.
arXiv Detail & Related papers (2023-10-11T08:31:26Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Policy Representation via Diffusion Probability Model for Reinforcement
Learning [67.56363353547775]
We build a theoretical foundation of policy representation via the diffusion probability model.
We present a convergence guarantee for diffusion policy, which provides a theory to understand the multimodality of diffusion policy.
We propose the DIPO which is an implementation for model-free online RL with DIffusion POlicy.
arXiv Detail & Related papers (2023-05-22T15:23:41Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Offline Reinforcement Learning via High-Fidelity Generative Behavior
Modeling [34.88897402357158]
We show that due to the limited distributional expressivity of policy models, previous methods might still select unseen actions during training.
We adopt a generative approach by decoupling the learned policy into two parts: an expressive generative behavior model and an action evaluation model.
Our proposed method achieves competitive or superior performance compared with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2022-09-29T04:36:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.