Related papers: IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

URL: http://arxiv.org/abs/2304.10573v2
Date: Fri, 19 May 2023 18:31:04 GMT
Title: IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
Authors: Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, Sergey Levine
Abstract summary: Implicit Q-learning (IQL) trains a Q-function using only dataset actions through a modified Bellman backup. It is unclear which policy actually attains the values represented by this trained Q-function. We introduce Implicit Q-learning (IDQL), combining our general IQL critic with the policy extraction method.
Score: 72.4573167739712
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Effective offline RL methods require properly handling out-of-distribution actions. Implicit Q-learning (IQL) addresses this by training a Q-function using only dataset actions through a modified Bellman backup. However, it is unclear which policy actually attains the values represented by this implicitly trained Q-function. In this paper, we reinterpret IQL as an actor-critic method by generalizing the critic objective and connecting it to a behavior-regularized implicit actor. This generalization shows how the induced actor balances reward maximization and divergence from the behavior policy, with the specific loss choice determining the nature of this tradeoff. Notably, this actor can exhibit complex and multimodal characteristics, suggesting issues with the conditional Gaussian actor fit with advantage weighted regression (AWR) used in prior methods. Instead, we propose using samples from a diffusion parameterized behavior policy and weights computed from the critic to then importance sampled our intended policy. We introduce Implicit Diffusion Q-learning (IDQL), combining our general IQL critic with the policy extraction method. IDQL maintains the ease of implementation of IQL while outperforming prior offline RL methods and demonstrating robustness to hyperparameters. Code is available at https://github.com/philippe-eecs/IDQL.

Related papers

Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning [13.163511229897667]
In offline reinforcement learning (RL), it is necessary to manage out-of-distribution actions to prevent overestimation of value functions. We propose Diffusion Actor-Critic (DAC) that formulates the Kullback-Leibler (KL) constraint policy iteration as a diffusion noise regression problem. Our approach is evaluated on the D4RL benchmarks and outperforms the state-of-the-art in almost all environments.
arXiv Detail & Related papers (2024-05-31T00:41:04Z)
AlignIQL: Policy Alignment in Implicit Q-Learning through Constrained Optimization [9.050431569438636]
Implicit Q-learning serves as a strong baseline for offline RL. We introduce a different way to solve the implicit policy-finding problem (IPF) by formulating this problem as an optimization problem. Compared with IQL and IDQL, we find our method keeps the simplicity of IQL and solves the implicit policy-finding problem.
arXiv Detail & Related papers (2024-05-28T14:01:03Z)
Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error. In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z)
Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization [90.9780151608281]
In-sample learning (IQL) improves the policy by quantile regression using only data samples. We make a key finding that the in-sample learning paradigm arises under the textitImplicit Value Regularization (IVR) framework. We propose two practical algorithms, Sparse $Q$-learning (EQL) and Exponential $Q$-learning (EQL), which adopt the same value regularization used in existing works.
arXiv Detail & Related papers (2023-03-28T08:30:01Z)
Quantile Filtered Imitation Learning [49.11859771578969]
quantile filtered imitation learning (QFIL) is a policy improvement operator designed for offline reinforcement learning. We prove that QFIL gives us a safe policy improvement step with function approximation. We see that QFIL performs well on the D4RL benchmark.
arXiv Detail & Related papers (2021-12-02T03:08:23Z)
Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy. We propose an offline RL method that never needs to evaluate actions outside of the dataset. This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z)
Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.