Optimistic Policy Optimization is Provably Efficient in Non-stationary
MDPs
- URL: http://arxiv.org/abs/2110.08984v1
- Date: Mon, 18 Oct 2021 02:33:20 GMT
- Title: Optimistic Policy Optimization is Provably Efficient in Non-stationary
MDPs
- Authors: Han Zhong, Zhuoran Yang, Zhaoran Wang Csaba Szepesv\'ari
- Abstract summary: We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs)
We propose the $underlinetextp$eriodically $underlinetextr$estarted $underlinetexto$ptimistic $underlinetextp$olicy $underlinetexto$ptimization algorithm (PROPO)
- Score: 45.6318149525364
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study episodic reinforcement learning (RL) in non-stationary linear kernel
Markov decision processes (MDPs). In this setting, both the reward function and
the transition kernel are linear with respect to the given feature maps and are
allowed to vary over time, as long as their respective parameter variations do
not exceed certain variation budgets. We propose the
$\underline{\text{p}}$eriodically $\underline{\text{r}}$estarted
$\underline{\text{o}}$ptimistic $\underline{\text{p}}$olicy
$\underline{\text{o}}$ptimization algorithm (PROPO), which is an optimistic
policy optimization algorithm with linear function approximation. PROPO
features two mechanisms: sliding-window-based policy evaluation and
periodic-restart-based policy improvement, which are tailored for policy
optimization in a non-stationary environment. In addition, only utilizing the
technique of sliding window, we propose a value-iteration algorithm. We
establish dynamic upper bounds for the proposed methods and a matching minimax
lower bound which shows the (near-) optimality of the proposed methods. To our
best knowledge, PROPO is the first provably efficient policy optimization
algorithm that handles non-stationarity.
Related papers
- Optimal Strong Regret and Violation in Constrained MDPs via Policy Optimization [37.24692425018]
We study online learning in emphconstrained MDPs (CMDPs)
Our algorithm implements a primal-dual scheme that employs a state-of-the-art policy optimization approach for adversarial MDPs.
arXiv Detail & Related papers (2024-10-03T07:54:04Z) - Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data.
We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice.
We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Low-Switching Policy Gradient with Exploration via Online Sensitivity
Sampling [23.989009116398208]
We design a low-switching sample-efficient policy optimization algorithm, LPO, with general non-linear function approximation.
We show that, our algorithm obtains an $varepsilon$-optimal policy with only $widetildeO(fractextpoly(d)varepsilon3)$ samples.
arXiv Detail & Related papers (2023-06-15T23:51:46Z) - Optimistic Natural Policy Gradient: a Simple Efficient Policy
Optimization Framework for Online RL [23.957148537567146]
This paper proposes a simple efficient policy optimization framework -- Optimistic NPG for online RL.
For $d$-dimensional linear MDPs, Optimistic NPG is computationally efficient, and learns an $varepsilon$-optimal policy within $tildeO(d2/varepsilon3)$ samples.
arXiv Detail & Related papers (2023-05-18T15:19:26Z) - Policy Gradient Algorithms Implicitly Optimize by Continuation [7.351769270728942]
We argue that exploration in policy-gradient algorithms consists in a continuation of the return of the policy at hand, and that policies should be history-dependent rather than to maximize the return.
arXiv Detail & Related papers (2023-05-11T14:50:20Z) - Proximal Point Imitation Learning [48.50107891696562]
We develop new algorithms with rigorous efficiency guarantees for infinite horizon imitation learning.
We leverage classical tools from optimization, in particular, the proximal-point method (PPM) and dual smoothing.
We achieve convincing empirical performance for both linear and neural network function approximation.
arXiv Detail & Related papers (2022-09-22T12:40:21Z) - Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used.
Second, to explain these findings we introduce the concept of committal rate for policy optimization.
Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z) - Bregman Gradient Policy Optimization [97.73041344738117]
We design a Bregman gradient policy optimization for reinforcement learning based on Bregman divergences and momentum techniques.
VR-BGPO reaches the best complexity $tilde(epsilon-3)$ for finding an $epsilon$stationary point only requiring one trajectory at each iteration.
arXiv Detail & Related papers (2021-06-23T01:08:54Z) - Near Optimal Policy Optimization via REPS [33.992374484681704]
emphrelative entropy policy search (REPS) has demonstrated successful policy learning on a number of simulated and real-world robotic domains.
There exist no guarantees on REPS's performance when using gradient-based solvers.
We introduce a technique that uses emphgenerative access to the underlying decision process to compute parameter updates that maintain favorable convergence to the optimal regularized policy.
arXiv Detail & Related papers (2021-03-17T16:22:59Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z) - Fast Global Convergence of Natural Policy Gradient Methods with Entropy
Regularization [44.24881971917951]
Natural policy gradient (NPG) methods are among the most widely used policy optimization algorithms.
We develop convergence guarantees for entropy-regularized NPG methods under softmax parameterization.
Our results accommodate a wide range of learning rates, and shed light upon the role of entropy regularization in enabling fast convergence.
arXiv Detail & Related papers (2020-07-13T17:58:41Z) - Proximal Gradient Algorithm with Momentum and Flexible Parameter Restart
for Nonconvex Optimization [73.38702974136102]
Various types of parameter restart schemes have been proposed for accelerated algorithms to facilitate their practical convergence in rates.
In this paper, we propose an algorithm for solving nonsmooth problems.
arXiv Detail & Related papers (2020-02-26T16:06:27Z) - Provably Efficient Exploration in Policy Optimization [117.09887790160406]
This paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO)
OPPO achieves $tildeO(sqrtd2 H3 T )$ regret.
To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
arXiv Detail & Related papers (2019-12-12T08:40:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.