Mildly Conservative Q-Learning for Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2206.04745v3
- Date: Wed, 21 Feb 2024 05:55:48 GMT
- Title: Mildly Conservative Q-Learning for Offline Reinforcement Learning
- Authors: Jiafei Lyu, Xiaoteng Ma, Xiu Li, Zongqing Lu
- Abstract summary: offline reinforcement learning (RL) defines the task of learning from a static logged dataset without continually interacting with the environment.
Existing approaches, penalizing the unseen actions or regularizing with the behavior policy, are too pessimistic.
We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values.
- Score: 63.2183622958666
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Offline reinforcement learning (RL) defines the task of learning from a
static logged dataset without continually interacting with the environment. The
distribution shift between the learned policy and the behavior policy makes it
necessary for the value function to stay conservative such that
out-of-distribution (OOD) actions will not be severely overestimated. However,
existing approaches, penalizing the unseen actions or regularizing with the
behavior policy, are too pessimistic, which suppresses the generalization of
the value function and hinders the performance improvement. This paper explores
mild but enough conservatism for offline learning while not harming
generalization. We propose Mildly Conservative Q-learning (MCQ), where OOD
actions are actively trained by assigning them proper pseudo Q values. We
theoretically show that MCQ induces a policy that behaves at least as well as
the behavior policy and no erroneous overestimation will occur for OOD actions.
Experimental results on the D4RL benchmarks demonstrate that MCQ achieves
remarkable performance compared with prior work. Furthermore, MCQ shows
superior generalization ability when transferring from offline to online, and
significantly outperforms baselines. Our code is publicly available at
https://github.com/dmksjfl/MCQ.
Related papers
- Is Value Learning Really the Main Bottleneck in Offline RL? [70.54708989409409]
We show that the choice of a policy extraction algorithm significantly affects the performance and scalability of offline RL.
We propose two simple test-time policy improvement methods and show that these methods lead to better performance.
arXiv Detail & Related papers (2024-06-13T17:07:49Z) - Strategically Conservative Q-Learning [89.17906766703763]
offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility.
The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions.
We propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate.
arXiv Detail & Related papers (2024-06-06T22:09:46Z) - Confidence-Conditioned Value Functions for Offline Reinforcement
Learning [86.59173545987984]
We propose a new form of Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability.
We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence.
arXiv Detail & Related papers (2022-12-08T23:56:47Z) - Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment.
We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return.
On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - BRAC+: Improved Behavior Regularized Actor Critic for Offline
Reinforcement Learning [14.432131909590824]
Offline Reinforcement Learning aims to train effective policies using previously collected datasets.
Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions.
We improve the behavior regularized offline reinforcement learning and propose BRAC+.
arXiv Detail & Related papers (2021-10-02T23:55:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.