Revisiting Peng's Q($\lambda$) for Modern Reinforcement Learning
- URL: http://arxiv.org/abs/2103.00107v1
- Date: Sat, 27 Feb 2021 02:29:01 GMT
- Title: Revisiting Peng's Q($\lambda$) for Modern Reinforcement Learning
- Authors: Tadashi Kozuno, Yunhao Tang, Mark Rowland, R\'emi Munos, Steven
Kapturowski, Will Dabney, Michal Valko, David Abel
- Abstract summary: Off-policy multi-step reinforcement learning algorithms consist of conservative and non-conservative algorithms.
Recent studies have shown that non-conservative algorithms outperform conservative ones.
- Score: 69.39357308375212
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Off-policy multi-step reinforcement learning algorithms consist of
conservative and non-conservative algorithms: the former actively cut traces,
whereas the latter do not. Recently, Munos et al. (2016) proved the convergence
of conservative algorithms to an optimal Q-function. In contrast,
non-conservative algorithms are thought to be unsafe and have a limited or no
theoretical guarantee. Nonetheless, recent studies have shown that
non-conservative algorithms empirically outperform conservative ones. Motivated
by the empirical results and the lack of theory, we carry out theoretical
analyses of Peng's Q($\lambda$), a representative example of non-conservative
algorithms. We prove that it also converges to an optimal policy provided that
the behavior policy slowly tracks a greedy policy in a way similar to
conservative policy iteration. Such a result has been conjectured to be true
but has not been proven. We also experiment with Peng's Q($\lambda$) in complex
continuous control tasks, confirming that Peng's Q($\lambda$) often outperforms
conservative algorithms despite its simplicity. These results indicate that
Peng's Q($\lambda$), which was thought to be unsafe, is a theoretically-sound
and practically effective algorithm.
Related papers
- Conservative Exploration for Policy Optimization via Off-Policy Policy
Evaluation [4.837737516460689]
We study the problem of conservative exploration, where the learner must at least be able to guarantee its performance is at least as good as a baseline policy.
We propose the first conservative provably efficient model-free algorithm for policy optimization in continuous finite-horizon problems.
arXiv Detail & Related papers (2023-12-24T10:59:32Z) - Bridging the Gap between Newton-Raphson Method and Regularized Policy
Iteration [13.166738075816493]
We show that regularized policy iteration is strictly equivalent to the standard Newton-Raphson method in the condition of smoothing out Bellman equation with strongly convex functions.
We prove that regularized policy iteration has global linear convergence with the rate being $gamma$ (discount factor)
We also show that a modified version of regularized policy iteration, i.e., with finite-step policy evaluation, is equivalent to inexact Newton method where the Newton iteration formula is solved with truncated iterations.
arXiv Detail & Related papers (2023-10-11T05:55:20Z) - Low-Switching Policy Gradient with Exploration via Online Sensitivity
Sampling [23.989009116398208]
We design a low-switching sample-efficient policy optimization algorithm, LPO, with general non-linear function approximation.
We show that, our algorithm obtains an $varepsilon$-optimal policy with only $widetildeO(fractextpoly(d)varepsilon3)$ samples.
arXiv Detail & Related papers (2023-06-15T23:51:46Z) - Revisiting Weighted Strategy for Non-stationary Parametric Bandits [82.1942459195896]
This paper revisits the weighted strategy for non-stationary parametric bandits.
We propose a refined analysis framework, which produces a simpler weight-based algorithm.
Our new framework can be used to improve regret bounds of other parametric bandits.
arXiv Detail & Related papers (2023-03-05T15:11:14Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Continuous-Time Fitted Value Iteration for Robust Policies [93.25997466553929]
Solving the Hamilton-Jacobi-Bellman equation is important in many domains including control, robotics and economics.
We propose continuous fitted value iteration (cFVI) and robust fitted value iteration (rFVI)
These algorithms leverage the non-linear control-affine dynamics and separable state and action reward of many continuous control problems.
arXiv Detail & Related papers (2021-10-05T11:33:37Z) - Implicitly Regularized RL with Implicit Q-Values [42.87920755961722]
The $Q$-function is a central quantity in many Reinforcement Learning (RL) algorithms for which RL agents behave following a (soft)-greedy policy.
We propose to parametrize the $Q$-function implicitly, as the sum of a log-policy and of a value function.
We derive a practical off-policy deep RL algorithm, suitable for large action spaces, and that enforces the softmax relation between the policy and the $Q$-value.
arXiv Detail & Related papers (2021-08-16T12:20:47Z) - Improved Regret Bound and Experience Replay in Regularized Policy
Iteration [22.621710838468097]
We study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation.
We first show that the regret analysis of the Politex algorithm can be sharpened from $O(T3/4)$ to $O(sqrtT)$ under nearly identical assumptions.
Our result provides the first high-probability $O(sqrtT)$ regret bound for a computationally efficient algorithm in this setting.
arXiv Detail & Related papers (2021-02-25T00:55:07Z) - Improved Algorithms for Conservative Exploration in Bandits [113.55554483194832]
We study the conservative learning problem in the contextual linear bandit setting and introduce a novel algorithm, the Conservative Constrained LinUCB (CLUCB2)
We derive regret bounds for CLUCB2 that match existing results and empirically show that it outperforms state-of-the-art conservative bandit algorithms in a number of synthetic and real-world problems.
arXiv Detail & Related papers (2020-02-08T19:35:01Z) - Conservative Exploration in Reinforcement Learning [113.55554483194832]
We introduce the notion of conservative exploration for average reward and finite horizon problems.
We present two optimistic algorithms that guarantee (w.h.p.) that the conservative constraint is never violated during learning.
arXiv Detail & Related papers (2020-02-08T19:09:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.