Related papers: DiffCPS: Diffusion Model based Constrained Policy Search for Offline Reinforcement Learning

DiffCPS: Diffusion Model based Constrained Policy Search for Offline Reinforcement Learning

URL: http://arxiv.org/abs/2310.05333v2
Date: Wed, 28 Feb 2024 13:48:09 GMT
Title: DiffCPS: Diffusion Model based Constrained Policy Search for Offline Reinforcement Learning
Authors: Longxiang He, Li Shen, Linrui Zhang, Junbo Tan, Xueqian Wang
Abstract summary: Constrained policy search is a fundamental problem in offline reinforcement learning. We propose a novel approach, $textbfDiffusion-based Constrained Policy Search$ (dubbed DiffCPS)
Score: 11.678012836760967
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Constrained policy search (CPS) is a fundamental problem in offline reinforcement learning, which is generally solved by advantage weighted regression (AWR). However, previous methods may still encounter out-of-distribution actions due to the limited expressivity of Gaussian-based policies. On the other hand, directly applying the state-of-the-art models with distribution expression capabilities (i.e., diffusion models) in the AWR framework is intractable since AWR requires exact policy probability densities, which is intractable in diffusion models. In this paper, we propose a novel approach, $\textbf{Diffusion-based Constrained Policy Search}$ (dubbed DiffCPS), which tackles the diffusion-based constrained policy search with the primal-dual method. The theoretical analysis reveals that strong duality holds for diffusion-based CPS problems, and upon introducing parameter approximation, an approximated solution can be obtained after $\mathcal{O}(1/\epsilon)$ number of dual iterations, where $\epsilon$ denotes the representation ability of the parametrized policy. Extensive experimental results based on the D4RL benchmark demonstrate the efficacy of our approach. We empirically show that DiffCPS achieves better or at least competitive performance compared to traditional AWR-based baselines as well as recent diffusion-based offline RL methods. The code is now available at https://github.com/felix-thu/DiffCPS.

Related papers

One-Step Flow Policy Mirror Descent [38.39095131927252]
Flow Policy Mirror Descent (FPMD) is an online RL algorithm that enables 1-step sampling during policy inference.<n>Our approach exploits a theoretical connection between the distribution variance and the discretization error of single-step sampling in straight flow matching models.
arXiv Detail & Related papers (2025-07-31T15:51:10Z)
DIME:Diffusion-Based Maximum Entropy Reinforcement Learning [37.420420953705396]
Maximum entropy reinforcement learning (MaxEnt-RL) has become the standard approach to RL due to its beneficial exploration properties. We propose Diffusion-Based Maximum Entropy RL (DIME) to overcome the intractability of computing their marginal entropy.
arXiv Detail & Related papers (2025-02-04T13:37:14Z)
Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models [56.92178753201331]
We tackle average-reward infinite-horizon POMDPs with an unknown transition model. We present a novel and simple estimator that overcomes this barrier.
arXiv Detail & Related papers (2025-01-30T22:29:41Z)
Diffusion Policies creating a Trust Region for Offline Reinforcement Learning [66.17291150498276]
We introduce a dual policy approach, Diffusion Trusted Q-Learning (DTQL), which comprises a diffusion policy for pure behavior cloning and a practical one-step policy. DTQL eliminates the need for iterative denoising sampling during both training and inference, making it remarkably computationally efficient. We show that DTQL could not only outperform other methods on the majority of the D4RL benchmark tasks but also demonstrate efficiency in training and inference speeds.
arXiv Detail & Related papers (2024-05-30T05:04:33Z)
Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality. We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO) Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions. We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z)
Diffusion Actor-Critic with Entropy Regulator [32.79341490514616]
We propose an online RL algorithm termed diffusion actor-critic with entropy regulator (DACER) This algorithm conceptualizes the reverse process of the diffusion model as a novel policy function. Experiments on MuJoCo benchmarks and a multimodal task demonstrate that the DACER algorithm achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-05-24T03:23:27Z)
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint [56.74058752955209]
This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF) We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment. We propose efficient algorithms with finite-sample theoretical guarantees.
arXiv Detail & Related papers (2023-12-18T18:58:42Z)
Efficient Diffusion Policies for Offline Reinforcement Learning [85.73757789282212]
Diffsuion-QL significantly boosts the performance of offline RL by representing a policy with a diffusion model. We propose efficient diffusion policy (EDP) to overcome these two challenges. EDP constructs actions from corrupted ones at training to avoid running the sampling chain.
arXiv Detail & Related papers (2023-05-31T17:55:21Z)
Policy Representation via Diffusion Probability Model for Reinforcement Learning [67.56363353547775]
We build a theoretical foundation of policy representation via the diffusion probability model. We present a convergence guarantee for diffusion policy, which provides a theory to understand the multimodality of diffusion policy. We propose the DIPO which is an implementation for model-free online RL with DIffusion POlicy.
arXiv Detail & Related papers (2023-05-22T15:23:41Z)
Offline Primal-Dual Reinforcement Learning for Linear MDPs [16.782625445546273]
Offline Reinforcement Learning (RL) aims to learn a near-optimal policy from a fixed dataset of transitions collected by another policy. This paper proposes a primal-dual optimization method based on the linear programming formulation of RL.
arXiv Detail & Related papers (2023-05-22T11:45:23Z)
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset. We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy. We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.