A Perspective of Q-value Estimation on Offline-to-Online Reinforcement
Learning
- URL: http://arxiv.org/abs/2312.07685v1
- Date: Tue, 12 Dec 2023 19:24:35 GMT
- Title: A Perspective of Q-value Estimation on Offline-to-Online Reinforcement
Learning
- Authors: Yinmin Zhang, Jie Liu, Chuming Li, Yazhe Niu, Yaodong Yang, Yu Liu,
Wanli Ouyang
- Abstract summary: offline-to-online Reinforcement Learning (O2O RL) aims to improve the performance of offline pretrained policy using only a few online samples.
Most O2O methods focus on the balance between RL objective and pessimism, or the utilization of offline and online samples.
- Score: 54.48409201256968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline-to-online Reinforcement Learning (O2O RL) aims to improve the
performance of offline pretrained policy using only a few online samples. Built
on offline RL algorithms, most O2O methods focus on the balance between RL
objective and pessimism, or the utilization of offline and online samples. In
this paper, from a novel perspective, we systematically study the challenges
that remain in O2O RL and identify that the reason behind the slow improvement
of the performance and the instability of online finetuning lies in the
inaccurate Q-value estimation inherited from offline pretraining. Specifically,
we demonstrate that the estimation bias and the inaccurate rank of Q-value
cause a misleading signal for the policy update, making the standard offline RL
algorithms, such as CQL and TD3-BC, ineffective in the online finetuning. Based
on this observation, we address the problem of Q-value estimation by two
techniques: (1) perturbed value update and (2) increased frequency of Q-value
updates. The first technique smooths out biased Q-value estimation with sharp
peaks, preventing early-stage policy exploitation of sub-optimal actions. The
second one alleviates the estimation bias inherited from offline pretraining by
accelerating learning. Extensive experiments on the MuJoco and Adroit
environments demonstrate that the proposed method, named SO2, significantly
alleviates Q-value estimation issues, and consistently improves the performance
against the state-of-the-art methods by up to 83.1%.
Related papers
- Strategically Conservative Q-Learning [89.17906766703763]
offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility.
The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions.
We propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate.
arXiv Detail & Related papers (2024-06-06T22:09:46Z) - Exclusively Penalized Q-learning for Offline Reinforcement Learning [4.916646834691489]
Constraint-based offline reinforcement learning (RL) involves policy constraints or imposing penalties on the value function to mitigate overestimation errors caused by distributional shift.
This paper focuses on a limitation in existing offline RL methods with penalized value function, indicating the potential for underestimation bias due to unnecessary bias introduced in the value function.
We propose Exclusively Penalized Q-learning (EPQ), which reduces estimation bias in the value function by selectively penalizing states that are prone to inducing estimation errors.
arXiv Detail & Related papers (2024-05-23T01:06:05Z) - Understanding, Predicting and Better Resolving Q-Value Divergence in
Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL.
We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training.
For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z) - Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement
Learning [125.8224674893018]
Offline Reinforcement Learning (RL) aims to learn policies from previously collected datasets without exploring the environment.
Applying off-policy algorithms to offline RL usually fails due to the extrapolation error caused by the out-of-distribution (OOD) actions.
We propose Pessimistic Bootstrapping for offline RL (PBRL), a purely uncertainty-driven offline algorithm without explicit policy constraints.
arXiv Detail & Related papers (2022-02-23T15:27:16Z) - Uncertainty-Based Offline Reinforcement Learning with Diversified
Q-Ensemble [16.92791301062903]
We propose an uncertainty-based offline RL method that takes into account the confidence of the Q-value prediction and does not require any estimation or sampling of the data distribution.
Surprisingly, we find that it is possible to substantially outperform existing offline RL methods on various tasks by simply increasing the number of Q-networks along with the clipped Q-learning.
arXiv Detail & Related papers (2021-10-04T16:40:13Z) - BRAC+: Improved Behavior Regularized Actor Critic for Offline
Reinforcement Learning [14.432131909590824]
Offline Reinforcement Learning aims to train effective policies using previously collected datasets.
Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions.
We improve the behavior regularized offline reinforcement learning and propose BRAC+.
arXiv Detail & Related papers (2021-10-02T23:55:49Z) - Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration.
Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states.
We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z) - Cross Learning in Deep Q-Networks [82.20059754270302]
We propose a novel cross Q-learning algorithm, aim at alleviating the well-known overestimation problem in value-based reinforcement learning methods.
Our algorithm builds on double Q-learning, by maintaining a set of parallel models and estimate the Q-value based on a randomly selected network.
arXiv Detail & Related papers (2020-09-29T04:58:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.