Conservative Q-Learning for Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2006.04779v3
- Date: Wed, 19 Aug 2020 17:07:05 GMT
- Title: Conservative Q-Learning for Offline Reinforcement Learning
- Authors: Aviral Kumar, Aurick Zhou, George Tucker, Sergey Levine
- Abstract summary: We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
- Score: 106.05582605650932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Effectively leveraging large, previously collected datasets in reinforcement
learning (RL) is a key challenge for large-scale real-world applications.
Offline RL algorithms promise to learn effective policies from
previously-collected, static datasets without further interaction. However, in
practice, offline RL presents a major challenge, and standard off-policy RL
methods can fail due to overestimation of values induced by the distributional
shift between the dataset and the learned policy, especially when training on
complex and multi-modal data distributions. In this paper, we propose
conservative Q-learning (CQL), which aims to address these limitations by
learning a conservative Q-function such that the expected value of a policy
under this Q-function lower-bounds its true value. We theoretically show that
CQL produces a lower bound on the value of the current policy and that it can
be incorporated into a policy learning procedure with theoretical improvement
guarantees. In practice, CQL augments the standard Bellman error objective with
a simple Q-value regularizer which is straightforward to implement on top of
existing deep Q-learning and actor-critic implementations. On both discrete and
continuous control domains, we show that CQL substantially outperforms existing
offline RL methods, often learning policies that attain 2-5 times higher final
return, especially when learning from complex and multi-modal data
distributions.
Related papers
- Strategically Conservative Q-Learning [89.17906766703763]
offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility.
The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions.
We propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate.
arXiv Detail & Related papers (2024-06-06T22:09:46Z) - Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - Action-Quantized Offline Reinforcement Learning for Robotic Skill
Learning [68.16998247593209]
offline reinforcement learning (RL) paradigm provides recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data.
In this paper, we propose an adaptive scheme for action quantization.
We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme.
arXiv Detail & Related papers (2023-10-18T06:07:10Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - BRAC+: Improved Behavior Regularized Actor Critic for Offline
Reinforcement Learning [14.432131909590824]
Offline Reinforcement Learning aims to train effective policies using previously collected datasets.
Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions.
We improve the behavior regularized offline reinforcement learning and propose BRAC+.
arXiv Detail & Related papers (2021-10-02T23:55:49Z) - Constraints Penalized Q-Learning for Safe Offline Reinforcement Learning [15.841609263723575]
We study the problem of safe offline reinforcement learning (RL)
The goal is to learn a policy that maximizes long-term reward while satisfying safety constraints given only offline data, without further interaction with the environment.
We show that na"ive approaches that combine techniques from safe RL and offline RL can only learn sub-optimal solutions.
arXiv Detail & Related papers (2021-07-19T16:30:14Z) - EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline
and Online RL [48.552287941528]
Off-policy reinforcement learning holds the promise of sample-efficient learning of decision-making policies.
In the offline RL setting, standard off-policy RL methods can significantly underperform.
We introduce Expected-Max Q-Learning (EMaQ), which is more closely related to the resulting practical algorithm.
arXiv Detail & Related papers (2020-07-21T21:13:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.