Offline RL with No OOD Actions: In-Sample Learning via Implicit Value
Regularization
- URL: http://arxiv.org/abs/2303.15810v1
- Date: Tue, 28 Mar 2023 08:30:01 GMT
- Title: Offline RL with No OOD Actions: In-Sample Learning via Implicit Value
Regularization
- Authors: Haoran Xu, Li Jiang, Jianxiong Li, Zhuoran Yang, Zhaoran Wang, Victor
Wai Kin Chan, Xianyuan Zhan
- Abstract summary: In-sample learning (IQL) improves the policy by quantile regression using only data samples.
We make a key finding that the in-sample learning paradigm arises under the textitImplicit Value Regularization (IVR) framework.
We propose two practical algorithms, Sparse $Q$-learning (EQL) and Exponential $Q$-learning (EQL), which adopt the same value regularization used in existing works.
- Score: 90.9780151608281
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Most offline reinforcement learning (RL) methods suffer from the trade-off
between improving the policy to surpass the behavior policy and constraining
the policy to limit the deviation from the behavior policy as computing
$Q$-values using out-of-distribution (OOD) actions will suffer from errors due
to distributional shift. The recently proposed \textit{In-sample Learning}
paradigm (i.e., IQL), which improves the policy by quantile regression using
only data samples, shows great promise because it learns an optimal policy
without querying the value function of any unseen actions. However, it remains
unclear how this type of method handles the distributional shift in learning
the value function. In this work, we make a key finding that the in-sample
learning paradigm arises under the \textit{Implicit Value Regularization} (IVR)
framework. This gives a deeper understanding of why the in-sample learning
paradigm works, i.e., it applies implicit value regularization to the policy.
Based on the IVR framework, we further propose two practical algorithms, Sparse
$Q$-learning (SQL) and Exponential $Q$-learning (EQL), which adopt the same
value regularization used in existing works, but in a complete in-sample
manner. Compared with IQL, we find that our algorithms introduce sparsity in
learning the value function, making them more robust in noisy data regimes. We
also verify the effectiveness of SQL and EQL on D4RL benchmark datasets and
show the benefits of in-sample learning by comparing them with CQL in small
data regimes.
Related papers
- AlignIQL: Policy Alignment in Implicit Q-Learning through Constrained Optimization [9.050431569438636]
Implicit Q-learning serves as a strong baseline for offline RL.
We introduce a different way to solve the implicit policy-finding problem (IPF) by formulating this problem as an optimization problem.
Compared with IQL and IDQL, we find our method keeps the simplicity of IQL and solves the implicit policy-finding problem.
arXiv Detail & Related papers (2024-05-28T14:01:03Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion
Policies [72.4573167739712]
Implicit Q-learning (IQL) trains a Q-function using only dataset actions through a modified Bellman backup.
It is unclear which policy actually attains the values represented by this trained Q-function.
We introduce Implicit Q-learning (IDQL), combining our general IQL critic with the policy extraction method.
arXiv Detail & Related papers (2023-04-20T18:04:09Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - BRAC+: Improved Behavior Regularized Actor Critic for Offline
Reinforcement Learning [14.432131909590824]
Offline Reinforcement Learning aims to train effective policies using previously collected datasets.
Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions.
We improve the behavior regularized offline reinforcement learning and propose BRAC+.
arXiv Detail & Related papers (2021-10-02T23:55:49Z) - Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.