Optimistic Policy Optimization is Provably Efficient in Non-stationary
MDPs
- URL: http://arxiv.org/abs/2110.08984v1
- Date: Mon, 18 Oct 2021 02:33:20 GMT
- Title: Optimistic Policy Optimization is Provably Efficient in Non-stationary
MDPs
- Authors: Han Zhong, Zhuoran Yang, Zhaoran Wang Csaba Szepesv\'ari
- Abstract summary: We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs)
We propose the $underlinetextp$eriodically $underlinetextr$estarted $underlinetexto$ptimistic $underlinetextp$olicy $underlinetexto$ptimization algorithm (PROPO)
- Score: 45.6318149525364
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study episodic reinforcement learning (RL) in non-stationary linear kernel
Markov decision processes (MDPs). In this setting, both the reward function and
the transition kernel are linear with respect to the given feature maps and are
allowed to vary over time, as long as their respective parameter variations do
not exceed certain variation budgets. We propose the
$\underline{\text{p}}$eriodically $\underline{\text{r}}$estarted
$\underline{\text{o}}$ptimistic $\underline{\text{p}}$olicy
$\underline{\text{o}}$ptimization algorithm (PROPO), which is an optimistic
policy optimization algorithm with linear function approximation. PROPO
features two mechanisms: sliding-window-based policy evaluation and
periodic-restart-based policy improvement, which are tailored for policy
optimization in a non-stationary environment. In addition, only utilizing the
technique of sliding window, we propose a value-iteration algorithm. We
establish dynamic upper bounds for the proposed methods and a matching minimax
lower bound which shows the (near-) optimality of the proposed methods. To our
best knowledge, PROPO is the first provably efficient policy optimization
algorithm that handles non-stationarity.
Related papers
- Optimal Strong Regret and Violation in Constrained MDPs via Policy Optimization [37.24692425018]
We study online learning in emphconstrained MDPs (CMDPs)
Our algorithm implements a primal-dual scheme that employs a state-of-the-art policy optimization approach for adversarial MDPs.
arXiv Detail & Related papers (2024-10-03T07:54:04Z) - Low-Switching Policy Gradient with Exploration via Online Sensitivity
Sampling [23.989009116398208]
We design a low-switching sample-efficient policy optimization algorithm, LPO, with general non-linear function approximation.
We show that, our algorithm obtains an $varepsilon$-optimal policy with only $widetildeO(fractextpoly(d)varepsilon3)$ samples.
arXiv Detail & Related papers (2023-06-15T23:51:46Z) - Optimistic Natural Policy Gradient: a Simple Efficient Policy
Optimization Framework for Online RL [23.957148537567146]
This paper proposes a simple efficient policy optimization framework -- Optimistic NPG for online RL.
For $d$-dimensional linear MDPs, Optimistic NPG is computationally efficient, and learns an $varepsilon$-optimal policy within $tildeO(d2/varepsilon3)$ samples.
arXiv Detail & Related papers (2023-05-18T15:19:26Z) - Policy Gradient Algorithms Implicitly Optimize by Continuation [7.351769270728942]
We argue that exploration in policy-gradient algorithms consists in a continuation of the return of the policy at hand, and that policies should be history-dependent rather than to maximize the return.
arXiv Detail & Related papers (2023-05-11T14:50:20Z) - Proximal Point Imitation Learning [48.50107891696562]
We develop new algorithms with rigorous efficiency guarantees for infinite horizon imitation learning.
We leverage classical tools from optimization, in particular, the proximal-point method (PPM) and dual smoothing.
We achieve convincing empirical performance for both linear and neural network function approximation.
arXiv Detail & Related papers (2022-09-22T12:40:21Z) - Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used.
Second, to explain these findings we introduce the concept of committal rate for policy optimization.
Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z) - Bregman Gradient Policy Optimization [97.73041344738117]
We design a Bregman gradient policy optimization for reinforcement learning based on Bregman divergences and momentum techniques.
VR-BGPO reaches the best complexity $tilde(epsilon-3)$ for finding an $epsilon$stationary point only requiring one trajectory at each iteration.
arXiv Detail & Related papers (2021-06-23T01:08:54Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z) - Provably Efficient Exploration in Policy Optimization [117.09887790160406]
This paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO)
OPPO achieves $tildeO(sqrtd2 H3 T )$ regret.
To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
arXiv Detail & Related papers (2019-12-12T08:40:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.