Related papers: Entropy-regularized Diffusion Policy with Q-Ensembles for Offline Reinforcement Learning

Entropy-regularized Diffusion Policy with Q-Ensembles for Offline Reinforcement Learning

URL: http://arxiv.org/abs/2402.04080v2
Date: Thu, 07 Nov 2024 19:23:16 GMT
Title: Entropy-regularized Diffusion Policy with Q-Ensembles for Offline Reinforcement Learning
Authors: Ruoqi Zhang, Ziwei Luo, Jens Sjölund, Thomas B. Schön, Per Mattsson,
Abstract summary: This paper presents advanced techniques of training diffusion policies for offline reinforcement learning (RL) We show that an SDE has a solution that we can use to calculate the log probability of the policy, yielding an entropy regularizer that improves the exploration of offline datasets. By combining the entropy-regularized diffusion policy with Q-ensembles in offline RL, our method achieves state-of-the-art performance on most tasks in D4RL benchmarks.
Score: 11.0460569590737
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents advanced techniques of training diffusion policies for offline reinforcement learning (RL). At the core is a mean-reverting stochastic differential equation (SDE) that transfers a complex action distribution into a standard Gaussian and then samples actions conditioned on the environment state with a corresponding reverse-time SDE, like a typical diffusion policy. We show that such an SDE has a solution that we can use to calculate the log probability of the policy, yielding an entropy regularizer that improves the exploration of offline datasets. To mitigate the impact of inaccurate value functions from out-of-distribution data points, we further propose to learn the lower confidence bound of Q-ensembles for more robust policy improvement. By combining the entropy-regularized diffusion policy with Q-ensembles in offline RL, our method achieves state-of-the-art performance on most tasks in D4RL benchmarks. Code is available at https://github.com/ruoqizzz/Entropy-Regularized-Diffusion-Policy-with-QEnsemble.

Related papers

Contractive Diffusion Policies: Robust Action Diffusion via Contractive Score-Based Sampling with Differential Equations [19.77729438305312]
Contractive Diffusion Policies (CDPs) induce contractive behavior in the diffusion sampling dynamics.<n>CDPs often outperform baseline policies, with pronounced benefits under data scarcity.
arXiv Detail & Related papers (2026-01-02T23:33:59Z)
Sampling from Energy-based Policies using Diffusion [14.542411354617983]
We introduce a diffusion-based approach for sampling from energy-based policies, where the negative Q-function defines the energy function. We show that our approach enhances exploration and captures multimodal behavior in continuous control tasks, addressing key limitations of existing methods.
arXiv Detail & Related papers (2024-10-02T08:09:33Z)
Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning [13.163511229897667]
In offline reinforcement learning (RL), it is necessary to manage out-of-distribution actions to prevent overestimation of value functions. We propose Diffusion Actor-Critic (DAC) that formulates the Kullback-Leibler (KL) constraint policy iteration as a diffusion noise regression problem. Our approach is evaluated on the D4RL benchmarks and outperforms the state-of-the-art in almost all environments.
arXiv Detail & Related papers (2024-05-31T00:41:04Z)
Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality. We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO) Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions. We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z)
Diffusion Actor-Critic with Entropy Regulator [32.79341490514616]
We propose an online RL algorithm termed diffusion actor-critic with entropy regulator (DACER) This algorithm conceptualizes the reverse process of the diffusion model as a novel policy function. Experiments on MuJoCo benchmarks and a multimodal task demonstrate that the DACER algorithm achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-05-24T03:23:27Z)
Stabilizing Policy Gradients for Stochastic Differential Equations via Consistency with Perturbation Process [11.01014302314467]
We focus on optimizing deep neural networks parameterized differential equations (SDEs) We propose constraining the SDE to be consistent with its associated perturbation process. Our framework offers a versatile selection of policy gradient methods to effectively and efficiently train SDEs.
arXiv Detail & Related papers (2024-03-07T02:24:45Z)
Learning from Sparse Offline Datasets via Conservative Density Estimation [27.93418377019955]
We propose a novel training algorithm called Conservative Density Estimation (CDE) CDE addresses the challenge by explicitly imposing constraints on the state-action occupancy stationary distribution. Our method achieves state-of-the-art performance on the D4RL benchmark.
arXiv Detail & Related papers (2024-01-16T20:42:15Z)
Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error. In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z)
Offline Imitation Learning with Suboptimal Demonstrations via Relaxed Distribution Matching [109.5084863685397]
offline imitation learning (IL) promises the ability to learn performant policies from pre-collected demonstrations without interactions with the environment. We present RelaxDICE, which employs an asymmetrically-relaxed f-divergence for explicit support regularization. Our method significantly outperforms the best prior offline method in six standard continuous control environments.
arXiv Detail & Related papers (2023-03-05T03:35:11Z)
Offline RL With Realistic Datasets: Heteroskedasticity and Support Constraints [82.43359506154117]
We show that typical offline reinforcement learning methods fail to learn from data with non-uniform variability. Our method is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation.
arXiv Detail & Related papers (2022-11-02T11:36:06Z)
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset. We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy. We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z)
Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs) Semi-implicit actor (SIA) powered by a flexible policy distribution. We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.