Cautiously Optimistic Policy Optimization and Exploration with Linear
Function Approximation
- URL: http://arxiv.org/abs/2103.12923v1
- Date: Wed, 24 Mar 2021 01:42:59 GMT
- Title: Cautiously Optimistic Policy Optimization and Exploration with Linear
Function Approximation
- Authors: Andrea Zanette, Ching-An Cheng, Alekh Agarwal
- Abstract summary: Policy optimization methods are popular reinforcement learning algorithms, because their incremental and on-policy nature makes them more stable than the value-based counterparts.
In this paper, we propose a new algorithm, COPOE, that overcomes the sample complexity issue of PCPG while retaining its robustness to model misspecification.
The result is an improvement in sample complexity from $widetildeO (1/epsilon11)$ for PCPG to $widetildeO (1/epsilon3)$ for PCPG, nearly bridging the gap with value-based techniques.
- Score: 48.744735294559824
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Policy optimization methods are popular reinforcement learning algorithms,
because their incremental and on-policy nature makes them more stable than the
value-based counterparts. However, the same properties also make them slow to
converge and sample inefficient, as the on-policy requirement precludes data
reuse and the incremental updates couple large iteration complexity into the
sample complexity. These characteristics have been observed in experiments as
well as in theory in the recent work of~\citet{agarwal2020pc}, which provides a
policy optimization method PCPG that can robustly find near optimal polices for
approximately linear Markov decision processes but suffers from an extremely
poor sample complexity compared with value-based techniques.
In this paper, we propose a new algorithm, COPOE, that overcomes the sample
complexity issue of PCPG while retaining its robustness to model
misspecification. Compared with PCPG, COPOE makes several important algorithmic
enhancements, such as enabling data reuse, and uses more refined analysis
techniques, which we expect to be more broadly applicable to designing new
reinforcement learning algorithms. The result is an improvement in sample
complexity from $\widetilde{O}(1/\epsilon^{11})$ for PCPG to
$\widetilde{O}(1/\epsilon^3)$ for PCPG, nearly bridging the gap with
value-based techniques.
Related papers
- Efficiently Escaping Saddle Points for Non-Convex Policy Optimization [40.0986936439803]
Policy gradient (PG) is widely used in reinforcement learning due to its scalability and good performance.
We propose a variance-reduced second-order method that uses second-order information in the form of Hessian vector products (HVP) and converges to an approximate second-order stationary point (SOSP) with sample complexity of $tildeO(epsilon-3)$.
arXiv Detail & Related papers (2023-11-15T12:36:45Z) - Low-Switching Policy Gradient with Exploration via Online Sensitivity
Sampling [23.989009116398208]
We design a low-switching sample-efficient policy optimization algorithm, LPO, with general non-linear function approximation.
We show that, our algorithm obtains an $varepsilon$-optimal policy with only $widetildeO(fractextpoly(d)varepsilon3)$ samples.
arXiv Detail & Related papers (2023-06-15T23:51:46Z) - Optimistic Natural Policy Gradient: a Simple Efficient Policy
Optimization Framework for Online RL [23.957148537567146]
This paper proposes a simple efficient policy optimization framework -- Optimistic NPG for online RL.
For $d$-dimensional linear MDPs, Optimistic NPG is computationally efficient, and learns an $varepsilon$-optimal policy within $tildeO(d2/varepsilon3)$ samples.
arXiv Detail & Related papers (2023-05-18T15:19:26Z) - Stochastic Policy Gradient Methods: Improved Sample Complexity for
Fisher-non-degenerate Policies [19.779044926914704]
We develop improved global convergence guarantees for a general class of Fisher-non-degenerate parameterized policies.
In this work, we propose a Normalized Policy Gradient method with Implicit Gradient Transport (N-PG-IGT) and derive a $tildemathcalO(varepsilon-2.5)$ sample complexity of this method.
We further improve this complexity to $tilde mathcalmathcalO (varepsilon-2)$ by considering a Hessian-Aided Recursive Policy Gradient
arXiv Detail & Related papers (2023-02-03T13:50:23Z) - Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks.
We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution.
We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z) - Bregman Gradient Policy Optimization [97.73041344738117]
We design a Bregman gradient policy optimization for reinforcement learning based on Bregman divergences and momentum techniques.
VR-BGPO reaches the best complexity $tilde(epsilon-3)$ for finding an $epsilon$stationary point only requiring one trajectory at each iteration.
arXiv Detail & Related papers (2021-06-23T01:08:54Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.