Online Target Q-learning with Reverse Experience Replay: Efficiently
finding the Optimal Policy for Linear MDPs
- URL: http://arxiv.org/abs/2110.08440v2
- Date: Tue, 19 Oct 2021 17:35:59 GMT
- Title: Online Target Q-learning with Reverse Experience Replay: Efficiently
finding the Optimal Policy for Linear MDPs
- Authors: Naman Agarwal, Syomantak Chaudhuri, Prateek Jain, Dheeraj Nagaraj,
Praneeth Netrapalli
- Abstract summary: We bridge the gap between practical success of Q-learning and pessimistic theoretical results.
We present novel methods Q-Rex and Q-RexDaRe.
We show that Q-Rex efficiently finds the optimal policy for linear MDPs.
- Score: 50.75812033462294
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Q-learning is a popular Reinforcement Learning (RL) algorithm which is widely
used in practice with function approximation (Mnih et al., 2015). In contrast,
existing theoretical results are pessimistic about Q-learning. For example,
(Baird, 1995) shows that Q-learning does not converge even with linear function
approximation for linear MDPs. Furthermore, even for tabular MDPs with
synchronous updates, Q-learning was shown to have sub-optimal sample complexity
(Li et al., 2021;Azar et al., 2013). The goal of this work is to bridge the gap
between practical success of Q-learning and the relatively pessimistic
theoretical results. The starting point of our work is the observation that in
practice, Q-learning is used with two important modifications: (i) training
with two networks, called online network and target network simultaneously
(online target learning, or OTL) , and (ii) experience replay (ER) (Mnih et
al., 2015). While they have been observed to play a significant role in the
practical success of Q-learning, a thorough theoretical understanding of how
these two modifications improve the convergence behavior of Q-learning has been
missing in literature. By carefully combining Q-learning with OTL and reverse
experience replay (RER) (a form of experience replay), we present novel methods
Q-Rex and Q-RexDaRe (Q-Rex + data reuse). We show that Q-Rex efficiently finds
the optimal policy for linear MDPs (or more generally for MDPs with zero
inherent Bellman error with linear approximation (ZIBEL)) and provide
non-asymptotic bounds on sample complexity -- the first such result for a
Q-learning method for this class of MDPs under standard assumptions.
Furthermore, we demonstrate that Q-RexDaRe in fact achieves near optimal sample
complexity in the tabular setting, improving upon the existing results for
vanilla Q-learning.
Related papers
- VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers [7.7705926659081275]
VerifierQ is a novel approach that integrates Offline Q-learning into verifier models.
We address three key challenges in applying Q-learning to LLMs.
Our method enables parallel Q-value computation and improving training efficiency.
arXiv Detail & Related papers (2024-10-10T15:43:55Z) - Q-value Regularized Transformer for Offline Reinforcement Learning [70.13643741130899]
We propose a Q-value regularized Transformer (QT) to enhance the state-of-the-art in offline reinforcement learning (RL)
QT learns an action-value function and integrates a term maximizing action-values into the training loss of Conditional Sequence Modeling (CSM)
Empirical evaluations on D4RL benchmark datasets demonstrate the superiority of QT over traditional DP and CSM methods.
arXiv Detail & Related papers (2024-05-27T12:12:39Z) - Offline Q-Learning on Diverse Multi-Task Data Both Scales And
Generalizes [100.69714600180895]
offline Q-learning algorithms exhibit strong performance that scales with model capacity.
We train a single policy on 40 games with near-human performance using up-to 80 million parameter networks.
Compared to return-conditioned supervised approaches, offline Q-learning scales similarly with model capacity and has better performance, especially when the dataset is suboptimal.
arXiv Detail & Related papers (2022-11-28T08:56:42Z) - Sufficient Exploration for Convex Q-learning [10.75319149461189]
This paper builds on the linear programming (LP) formulation of optimal control of Manne.
A primal version is called logistic Q-learning, and a dual variant is convex Q-learning.
It is shown that convex Q-learning is successful in cases where standard Q-learning diverges.
arXiv Detail & Related papers (2022-10-17T20:22:12Z) - Convergence Results For Q-Learning With Experience Replay [51.11953997546418]
We provide a convergence rate guarantee, and discuss how it compares to the convergence of Q-learning depending on important parameters such as the frequency and number of iterations of replay.
We also provide theoretical evidence showing when we might expect this to strictly improve performance, by introducing and analyzing a simple class of MDPs.
arXiv Detail & Related papers (2021-12-08T10:22:49Z) - Self-correcting Q-Learning [14.178899938667161]
We introduce a new way to address the bias in the form of a "self-correcting algorithm"
Applying this strategy to Q-learning results in Self-correcting Q-learning.
We show theoretically that this new algorithm enjoys the same convergence guarantees as Q-learning while being more accurate.
arXiv Detail & Related papers (2020-12-02T11:36:24Z) - Cross Learning in Deep Q-Networks [82.20059754270302]
We propose a novel cross Q-learning algorithm, aim at alleviating the well-known overestimation problem in value-based reinforcement learning methods.
Our algorithm builds on double Q-learning, by maintaining a set of parallel models and estimate the Q-value based on a randomly selected network.
arXiv Detail & Related papers (2020-09-29T04:58:17Z) - Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.