On the Convergence and Sample Efficiency of Variance-Reduced Policy
Gradient Method
- URL: http://arxiv.org/abs/2102.08607v1
- Date: Wed, 17 Feb 2021 07:06:19 GMT
- Title: On the Convergence and Sample Efficiency of Variance-Reduced Policy
Gradient Method
- Authors: Junyu Zhang, Chengzhuo Ni, Zheng Yu, Csaba Szepesvari, Mengdi Wang
- Abstract summary: Policy gives rise to a rich class of reinforcement learning (RL) methods, for example the REINFORCE.
Yet the best known sample complexity result for such methods to find an $epsilon$-optimal policy is $mathcalO(epsilon-3)$, which is suboptimal.
We study the fundamental convergence properties and sample efficiency of first-order policy optimization method.
- Score: 38.34416337932712
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Policy gradient gives rise to a rich class of reinforcement learning (RL)
methods, for example the REINFORCE. Yet the best known sample complexity result
for such methods to find an $\epsilon$-optimal policy is
$\mathcal{O}(\epsilon^{-3})$, which is suboptimal. In this paper, we study the
fundamental convergence properties and sample efficiency of first-order policy
optimization method. We focus on a generalized variant of policy gradient
method, which is able to maximize not only a cumulative sum of rewards but also
a general utility function over a policy's long-term visiting distribution. By
exploiting the problem's hidden convex nature and leveraging techniques from
composition optimization, we propose a Stochastic Incremental Variance-Reduced
Policy Gradient (SIVR-PG) approach that improves a sequence of policies to
provably converge to the global optimal solution and finds an
$\epsilon$-optimal policy using $\tilde{\mathcal{O}}(\epsilon^{-2})$ samples.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Low-Switching Policy Gradient with Exploration via Online Sensitivity
Sampling [23.989009116398208]
We design a low-switching sample-efficient policy optimization algorithm, LPO, with general non-linear function approximation.
We show that, our algorithm obtains an $varepsilon$-optimal policy with only $widetildeO(fractextpoly(d)varepsilon3)$ samples.
arXiv Detail & Related papers (2023-06-15T23:51:46Z) - Adaptive Policy Learning to Additional Tasks [3.43814540650436]
This paper develops a policy learning method for tuning a pre-trained policy to adapt to additional tasks without altering the original task.
A method named Adaptive Policy Gradient (APG) is proposed in this paper, which combines Bellman's principle of optimality with the policy gradient approach to improve the convergence rate.
arXiv Detail & Related papers (2023-05-24T14:31:11Z) - Sample Complexity of Policy-Based Methods under Off-Policy Sampling and
Linear Function Approximation [8.465228064780748]
off-policy sampling and linear function approximation are employed for policy evaluation.
Various policy update rules, including natural policy gradient (NPG), are considered for policy update.
We establish for the first time an overall $mathcalO(epsilon-2)$ sample complexity for finding an optimal policy.
arXiv Detail & Related papers (2022-08-05T15:59:05Z) - Policy Gradient Method For Robust Reinforcement Learning [23.62008807533706]
This paper develops the first policy gradient method with global optimality guarantee and complexity analysis for robust reinforcement learning under model mismatch.
We show that the proposed robust policy gradient method converges to the global optimum gradient under direct policy parameterization.
We then extend our methodology to the general model-free setting and design the robust actoriable parametric policy class and value function.
arXiv Detail & Related papers (2022-05-15T17:35:17Z) - Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used.
Second, to explain these findings we introduce the concept of committal rate for policy optimization.
Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z) - Bregman Gradient Policy Optimization [97.73041344738117]
We design a Bregman gradient policy optimization for reinforcement learning based on Bregman divergences and momentum techniques.
VR-BGPO reaches the best complexity $tilde(epsilon-3)$ for finding an $epsilon$stationary point only requiring one trajectory at each iteration.
arXiv Detail & Related papers (2021-06-23T01:08:54Z) - Near Optimal Policy Optimization via REPS [33.992374484681704]
emphrelative entropy policy search (REPS) has demonstrated successful policy learning on a number of simulated and real-world robotic domains.
There exist no guarantees on REPS's performance when using gradient-based solvers.
We introduce a technique that uses emphgenerative access to the underlying decision process to compute parameter updates that maintain favorable convergence to the optimal regularized policy.
arXiv Detail & Related papers (2021-03-17T16:22:59Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z) - Provably Efficient Exploration in Policy Optimization [117.09887790160406]
This paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO)
OPPO achieves $tildeO(sqrtd2 H3 T )$ regret.
To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
arXiv Detail & Related papers (2019-12-12T08:40:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.