Rate-Optimal Policy Optimization for Linear Markov Decision Processes
- URL: http://arxiv.org/abs/2308.14642v3
- Date: Thu, 16 May 2024 11:37:06 GMT
- Title: Rate-Optimal Policy Optimization for Linear Markov Decision Processes
- Authors: Uri Sherman, Alon Cohen, Tomer Koren, Yishay Mansour,
- Abstract summary: We obtain rate-optimal $widetilde O (sqrt K)$ regret where $K$ denotes the number of episodes.
Our work is the first to establish the optimal (w.r.t.$K$) rate of convergence in the setting with bandit feedback.
No algorithm with an optimal rate guarantee is currently known.
- Score: 65.5958446762678
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study regret minimization in online episodic linear Markov Decision Processes, and obtain rate-optimal $\widetilde O (\sqrt K)$ regret where $K$ denotes the number of episodes. Our work is the first to establish the optimal (w.r.t.~$K$) rate of convergence in the stochastic setting with bandit feedback using a policy optimization based approach, and the first to establish the optimal (w.r.t.~$K$) rate in the adversarial setup with full information feedback, for which no algorithm with an optimal rate guarantee is currently known.
Related papers
- Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes [12.76843681997386]
Policy Optimization (PO) methods are among the most popular Reinforcement Learning (RL) algorithms in practice.
This paper proposes a PO-based algorithm with rate-optimal regret guarantees under the linear Markov Decision Process (MDP) model.
Our algorithm achieves regret with improved dependence on the other parameters of the problem.
arXiv Detail & Related papers (2024-07-03T12:36:24Z) - Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs
with Short Burn-In Time [13.545356254920584]
We introduce a model-free algorithm that employs variance reduction and a novel technique that switches the execution policy in a slow-yet-adaptive manner.
This is the first regret-optimal model-free algorithm in the discounted setting, with the additional benefit of a low burn-in time.
arXiv Detail & Related papers (2023-05-24T20:22:43Z) - Autoregressive Bandits [58.46584210388307]
We propose a novel online learning setting, Autoregressive Bandits, in which the observed reward is governed by an autoregressive process of order $k$.
We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed.
We then devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $widetildemathcalO left( frac(k+1)3/2sqrtnT (1-G
arXiv Detail & Related papers (2022-12-12T21:37:36Z) - Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used.
Second, to explain these findings we introduce the concept of committal rate for policy optimization.
Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z) - Lexicographic Optimisation of Conditional Value at Risk and Expected
Value for Risk-Averse Planning in MDPs [4.87191262649216]
Planning in Markov decision processes (MDPs) typically optimises the expected cost.
An alternative approach is to find a policy which optimises a risk-averse objective such as conditional value at risk (CVaR)
We formulate the lexicographic optimisation problem of minimising the expected cost subject to the constraint that the CVaR of the total cost is optimal.
arXiv Detail & Related papers (2021-10-25T09:16:50Z) - Optimistic Policy Optimization is Provably Efficient in Non-stationary
MDPs [45.6318149525364]
We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs)
We propose the $underlinetextp$eriodically $underlinetextr$estarted $underlinetexto$ptimistic $underlinetextp$olicy $underlinetexto$ptimization algorithm (PROPO)
arXiv Detail & Related papers (2021-10-18T02:33:20Z) - Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks.
We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution.
We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z) - On the Optimality of Batch Policy Optimization Algorithms [106.89498352537682]
Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment.
We show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral.
We introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.
arXiv Detail & Related papers (2021-04-06T05:23:20Z) - Optimistic Policy Optimization with Bandit Feedback [70.75568142146493]
We propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $tilde O(sqrtS2 A H4 K)$ regret for previous rewards.
To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
arXiv Detail & Related papers (2020-02-19T15:41:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.