Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2405.20555v2
- Date: Wed, 19 Feb 2025 04:35:05 GMT
- Title: Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning
- Authors: Linjiajie Fang, Ruoxue Liu, Jing Zhang, Wenjia Wang, Bing-Yi Jing,
- Abstract summary: In offline reinforcement learning, it is necessary to manage out-of-distribution actions to prevent overestimation of value functions.
We propose Diffusion Actor-Critic (DAC) that formulates the Kullback-Leibler (KL) constraint policy iteration as a diffusion noise regression problem.
Our approach is evaluated on D4RL benchmarks and outperforms the state-of-the-art in nearly all environments.
- Score: 13.163511229897667
- License:
- Abstract: In offline reinforcement learning, it is necessary to manage out-of-distribution actions to prevent overestimation of value functions. One class of methods, the policy-regularized method, addresses this problem by constraining the target policy to stay close to the behavior policy. Although several approaches suggest representing the behavior policy as an expressive diffusion model to boost performance, it remains unclear how to regularize the target policy given a diffusion-modeled behavior sampler. In this paper, we propose Diffusion Actor-Critic (DAC) that formulates the Kullback-Leibler (KL) constraint policy iteration as a diffusion noise regression problem, enabling direct representation of target policies as diffusion models. Our approach follows the actor-critic learning paradigm in which we alternatively train a diffusion-modeled target policy and a critic network. The actor training loss includes a soft Q-guidance term from the Q-gradient. The soft Q-guidance is based on the theoretical solution of the KL constraint policy iteration, which prevents the learned policy from taking out-of-distribution actions. We demonstrate that such diffusion-based policy constraint, along with the coupling of the lower confidence bound of the Q-ensemble as value targets, not only preserves the multi-modality of target policies, but also contributes to stable convergence and strong performance in DAC. Our approach is evaluated on D4RL benchmarks and outperforms the state-of-the-art in nearly all environments. Code is available at https://github.com/Fang-Lin93/DAC.
Related papers
- Behavior-Regularized Diffusion Policy Optimization for Offline Reinforcement Learning [22.333460316347264]
We introduce BDPO, a principled behavior-regularized RL framework tailored for diffusion-based policies.
We develop an efficient two-time-scale actor-critic RL algorithm that produces the optimal policy while respecting the behavior constraint.
arXiv Detail & Related papers (2025-02-07T09:30:35Z) - Constrained Latent Action Policies for Model-Based Offline Reinforcement Learning [5.012314384895537]
In offline reinforcement learning, a policy is learned using a static dataset in the absence of costly feedback from the environment.
We propose Constrained Latent Action Policies (C-LAP) which learns a generative model of the joint distribution of observations and actions.
arXiv Detail & Related papers (2024-11-07T09:35:22Z) - Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality.
We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO)
Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions.
We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z) - Learning a Diffusion Model Policy from Rewards via Q-Score Matching [93.0191910132874]
We present a theoretical framework linking the structure of diffusion model policies to a learned Q-function.
We propose a new policy update method from this theory, which we denote Q-score matching.
arXiv Detail & Related papers (2023-12-18T23:31:01Z) - Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - Policy Representation via Diffusion Probability Model for Reinforcement
Learning [67.56363353547775]
We build a theoretical foundation of policy representation via the diffusion probability model.
We present a convergence guarantee for diffusion policy, which provides a theory to understand the multimodality of diffusion policy.
We propose the DIPO which is an implementation for model-free online RL with DIffusion POlicy.
arXiv Detail & Related papers (2023-05-22T15:23:41Z) - Diffusion Policies as an Expressive Policy Class for Offline
Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset.
We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy.
We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.