DiffCPS: Diffusion Model based Constrained Policy Search for Offline
Reinforcement Learning
- URL: http://arxiv.org/abs/2310.05333v2
- Date: Wed, 28 Feb 2024 13:48:09 GMT
- Title: DiffCPS: Diffusion Model based Constrained Policy Search for Offline
Reinforcement Learning
- Authors: Longxiang He, Li Shen, Linrui Zhang, Junbo Tan, Xueqian Wang
- Abstract summary: Constrained policy search is a fundamental problem in offline reinforcement learning.
We propose a novel approach, $textbfDiffusion-based Constrained Policy Search$ (dubbed DiffCPS)
- Score: 11.678012836760967
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Constrained policy search (CPS) is a fundamental problem in offline
reinforcement learning, which is generally solved by advantage weighted
regression (AWR). However, previous methods may still encounter
out-of-distribution actions due to the limited expressivity of Gaussian-based
policies. On the other hand, directly applying the state-of-the-art models with
distribution expression capabilities (i.e., diffusion models) in the AWR
framework is intractable since AWR requires exact policy probability densities,
which is intractable in diffusion models. In this paper, we propose a novel
approach, $\textbf{Diffusion-based Constrained Policy Search}$ (dubbed
DiffCPS), which tackles the diffusion-based constrained policy search with the
primal-dual method. The theoretical analysis reveals that strong duality holds
for diffusion-based CPS problems, and upon introducing parameter approximation,
an approximated solution can be obtained after $\mathcal{O}(1/\epsilon)$ number
of dual iterations, where $\epsilon$ denotes the representation ability of the
parametrized policy. Extensive experimental results based on the D4RL benchmark
demonstrate the efficacy of our approach. We empirically show that DiffCPS
achieves better or at least competitive performance compared to traditional
AWR-based baselines as well as recent diffusion-based offline RL methods. The
code is now available at https://github.com/felix-thu/DiffCPS.
Related papers
- Diffusion Policies creating a Trust Region for Offline Reinforcement Learning [66.17291150498276]
We introduce a dual policy approach, Diffusion Trusted Q-Learning (DTQL), which comprises a diffusion policy for pure behavior cloning and a practical one-step policy.
DTQL eliminates the need for iterative denoising sampling during both training and inference, making it remarkably computationally efficient.
We show that DTQL could not only outperform other methods on the majority of the D4RL benchmark tasks but also demonstrate efficiency in training and inference speeds.
arXiv Detail & Related papers (2024-05-30T05:04:33Z) - Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality.
We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO)
Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions.
We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z) - Diffusion Actor-Critic with Entropy Regulator [32.79341490514616]
We propose an online RL algorithm termed diffusion actor-critic with entropy regulator (DACER)
This algorithm conceptualizes the reverse process of the diffusion model as a novel policy function.
Experiments on MuJoCo benchmarks and a multimodal task demonstrate that the DACER algorithm achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-05-24T03:23:27Z) - Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint [56.74058752955209]
This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF)
We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment.
We propose efficient algorithms with finite-sample theoretical guarantees.
arXiv Detail & Related papers (2023-12-18T18:58:42Z) - Efficient Diffusion Policies for Offline Reinforcement Learning [85.73757789282212]
Diffsuion-QL significantly boosts the performance of offline RL by representing a policy with a diffusion model.
We propose efficient diffusion policy (EDP) to overcome these two challenges.
EDP constructs actions from corrupted ones at training to avoid running the sampling chain.
arXiv Detail & Related papers (2023-05-31T17:55:21Z) - Policy Representation via Diffusion Probability Model for Reinforcement
Learning [67.56363353547775]
We build a theoretical foundation of policy representation via the diffusion probability model.
We present a convergence guarantee for diffusion policy, which provides a theory to understand the multimodality of diffusion policy.
We propose the DIPO which is an implementation for model-free online RL with DIffusion POlicy.
arXiv Detail & Related papers (2023-05-22T15:23:41Z) - Offline Primal-Dual Reinforcement Learning for Linear MDPs [16.782625445546273]
Offline Reinforcement Learning (RL) aims to learn a near-optimal policy from a fixed dataset of transitions collected by another policy.
This paper proposes a primal-dual optimization method based on the linear programming formulation of RL.
arXiv Detail & Related papers (2023-05-22T11:45:23Z) - Diffusion Policies as an Expressive Policy Class for Offline
Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset.
We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy.
We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.