Sample Efficient Reinforcement Learning with REINFORCE
- URL: http://arxiv.org/abs/2010.11364v2
- Date: Thu, 24 Dec 2020 18:22:03 GMT
- Title: Sample Efficient Reinforcement Learning with REINFORCE
- Authors: Junzi Zhang, Jongho Kim, Brendan O'Donoghue, Stephen Boyd
- Abstract summary: We consider classical policy gradient methods and the widely-used REINFORCE estimation procedure.
By controlling number of "bad" episodes, we establish an anytime sub-linear high regret bound as well as almost sure global convergence of the average regret with anally sub-linear rate.
These provide the first set of global convergence and sample efficiency results for the well-known REINFORCE algorithm and contribute to a better understanding of its performance in practice.
- Score: 10.884278019498588
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Policy gradient methods are among the most effective methods for large-scale
reinforcement learning, and their empirical success has prompted several works
that develop the foundation of their global convergence theory. However, prior
works have either required exact gradients or state-action visitation measure
based mini-batch stochastic gradients with a diverging batch size, which limit
their applicability in practical scenarios. In this paper, we consider
classical policy gradient methods that compute an approximate gradient with a
single trajectory or a fixed size mini-batch of trajectories under soft-max
parametrization and log-barrier regularization, along with the widely-used
REINFORCE gradient estimation procedure. By controlling the number of "bad"
episodes and resorting to the classical doubling trick, we establish an anytime
sub-linear high probability regret bound as well as almost sure global
convergence of the average regret with an asymptotically sub-linear rate. These
provide the first set of global convergence and sample efficiency results for
the well-known REINFORCE algorithm and contribute to a better understanding of
its performance in practice.
Related papers
- AdaGrad under Anisotropic Smoothness [10.995979046710893]
We propose a novel anisotropic generalized smoothness assumption and provide corresponding analyses of Adagrad.
It is shown that under anisotropic smoothness and noise conditions, AdaGrad can achieve faster convergence guarantees in terms of better dimensional dependence.
arXiv Detail & Related papers (2024-06-21T15:29:31Z) - Almost sure convergence rates of stochastic gradient methods under gradient domination [2.96614015844317]
Global and local gradient domination properties have shown to be a more realistic replacement of strong convexity.
We prove almost sure convergence rates $f(X_n)-f*in obig( n-frac14beta-1+epsilonbig)$ of the last iterate for gradient descent.
We show how to apply our results to the training task in both supervised and reinforcement learning.
arXiv Detail & Related papers (2024-05-22T12:40:57Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Optimization Landscape of Policy Gradient Methods for Discrete-time
Static Output Feedback [22.21598324895312]
This paper analyzes the optimization landscape inherent to policy gradient methods when applied to static output feedback control.
We derive novel findings regarding convergence (and nearly dimension-free rate) to stationary points for three policy gradient methods.
We provide proof that the vanilla policy gradient method exhibits linear convergence towards local minima when near such minima.
arXiv Detail & Related papers (2023-10-29T14:25:57Z) - Bag of Tricks for Natural Policy Gradient Reinforcement Learning [87.54231228860495]
We have implemented and compared strategies that impact performance in natural policy gradient reinforcement learning.
The proposed collection of strategies for performance optimization can improve results by 86% to 181% across the MuJuCo control benchmark.
arXiv Detail & Related papers (2022-01-22T17:44:19Z) - Beyond Exact Gradients: Convergence of Stochastic Soft-Max Policy Gradient Methods with Entropy Regularization [20.651913793555163]
We revisit the classical entropy regularized policy gradient methods with the soft-max policy parametrization.
We establish a global optimality convergence result and a sample complexity of $widetildemathcalO(frac1epsilon2)$ for the proposed algorithm.
arXiv Detail & Related papers (2021-10-19T17:21:09Z) - Deep learning: a statistical viewpoint [120.94133818355645]
Deep learning has revealed some major surprises from a theoretical perspective.
In particular, simple gradient methods easily find near-perfect solutions to non-optimal training problems.
We conjecture that specific principles underlie these phenomena.
arXiv Detail & Related papers (2021-03-16T16:26:36Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - A Nonparametric Off-Policy Policy Gradient [32.35604597324448]
Reinforcement learning (RL) algorithms still suffer from high sample complexity despite outstanding recent successes.
We build on the general sample efficiency of off-policy algorithms.
We show that our approach has better sample efficiency than state-of-the-art policy gradient methods.
arXiv Detail & Related papers (2020-01-08T10:13:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.