Strategically Conservative Q-Learning
- URL: http://arxiv.org/abs/2406.04534v1
- Date: Thu, 6 Jun 2024 22:09:46 GMT
- Title: Strategically Conservative Q-Learning
- Authors: Yutaka Shimizu, Joey Hong, Sergey Levine, Masayoshi Tomizuka,
- Abstract summary: offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility.
The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions.
We propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate.
- Score: 89.17906766703763
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility by leveraging pre-collected, static datasets, thereby avoiding the limitations associated with collecting online interactions. The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions; doing so ineffectively will lead to policies that prefer OOD actions, which can lead to unexpected and potentially catastrophic results. Despite the variety of works proposed to address this issue, they tend to excessively suppress the value function in and around OOD regions, resulting in overly pessimistic value estimates. In this paper, we propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate, ultimately resulting in less conservative value estimates. Our approach exploits the inherent strengths of neural networks to interpolate, while carefully navigating their limitations in extrapolation, to obtain pessimistic yet still property calibrated value estimates. Theoretical analysis also shows that the value function learned by SCQ is still conservative, but potentially much less so than that of Conservative Q-learning (CQL). Finally, extensive evaluation on the D4RL benchmark tasks shows our proposed method outperforms state-of-the-art methods. Our code is available through \url{https://github.com/purewater0901/SCQ}.
Related papers
- Conservative State Value Estimation for Offline Reinforcement Learning [36.416504941791224]
Conservative State Value Estimation (CSVE) learns conservative V-function via directly imposing penalty on OOD states.
We develop a practical actor-critic algorithm in which the critic does the conservative value estimation by additionally sampling and penalizing the states empharound the dataset.
We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods and is strongly competitive among recent SOTA methods.
arXiv Detail & Related papers (2023-02-14T08:13:55Z) - Mildly Conservative Q-Learning for Offline Reinforcement Learning [63.2183622958666]
offline reinforcement learning (RL) defines the task of learning from a static logged dataset without continually interacting with the environment.
Existing approaches, penalizing the unseen actions or regularizing with the behavior policy, are too pessimistic.
We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values.
arXiv Detail & Related papers (2022-06-09T19:44:35Z) - RORL: Robust Offline Reinforcement Learning via Conservative Smoothing [72.8062448549897]
offline reinforcement learning can exploit the massive amount of offline data for complex decision-making tasks.
Current offline RL algorithms are generally designed to be conservative for value estimation and action selection.
We propose Robust Offline Reinforcement Learning (RORL) with a novel conservative smoothing technique.
arXiv Detail & Related papers (2022-06-06T18:07:41Z) - Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement
Learning [125.8224674893018]
Offline Reinforcement Learning (RL) aims to learn policies from previously collected datasets without exploring the environment.
Applying off-policy algorithms to offline RL usually fails due to the extrapolation error caused by the out-of-distribution (OOD) actions.
We propose Pessimistic Bootstrapping for offline RL (PBRL), a purely uncertainty-driven offline algorithm without explicit policy constraints.
arXiv Detail & Related papers (2022-02-23T15:27:16Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Uncertainty-Based Offline Reinforcement Learning with Diversified
Q-Ensemble [16.92791301062903]
We propose an uncertainty-based offline RL method that takes into account the confidence of the Q-value prediction and does not require any estimation or sampling of the data distribution.
Surprisingly, we find that it is possible to substantially outperform existing offline RL methods on various tasks by simply increasing the number of Q-networks along with the clipped Q-learning.
arXiv Detail & Related papers (2021-10-04T16:40:13Z) - BRAC+: Improved Behavior Regularized Actor Critic for Offline
Reinforcement Learning [14.432131909590824]
Offline Reinforcement Learning aims to train effective policies using previously collected datasets.
Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions.
We improve the behavior regularized offline reinforcement learning and propose BRAC+.
arXiv Detail & Related papers (2021-10-02T23:55:49Z) - Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.