Rate-Optimal Policy Optimization for Linear Markov Decision Processes
- URL: http://arxiv.org/abs/2308.14642v3
- Date: Thu, 16 May 2024 11:37:06 GMT
- Title: Rate-Optimal Policy Optimization for Linear Markov Decision Processes
- Authors: Uri Sherman, Alon Cohen, Tomer Koren, Yishay Mansour,
- Abstract summary: We obtain rate-optimal $widetilde O (sqrt K)$ regret where $K$ denotes the number of episodes.
Our work is the first to establish the optimal (w.r.t.$K$) rate of convergence in the setting with bandit feedback.
No algorithm with an optimal rate guarantee is currently known.
- Score: 65.5958446762678
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study regret minimization in online episodic linear Markov Decision Processes, and obtain rate-optimal $\widetilde O (\sqrt K)$ regret where $K$ denotes the number of episodes. Our work is the first to establish the optimal (w.r.t.~$K$) rate of convergence in the stochastic setting with bandit feedback using a policy optimization based approach, and the first to establish the optimal (w.r.t.~$K$) rate in the adversarial setup with full information feedback, for which no algorithm with an optimal rate guarantee is currently known.
Related papers
- Optimal Rates for Robust Stochastic Convex Optimization [12.620782629498812]
We develop novel algorithms that achieve minimax-optimal excess risk under the $epsilon$-contamination model.
Our algorithms do not require restrictive assumptions, and can handle nonsmooth but Lipschitz population loss functions.
arXiv Detail & Related papers (2024-12-15T00:52:08Z) - Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs
with Short Burn-In Time [13.545356254920584]
We introduce a model-free algorithm that employs variance reduction and a novel technique that switches the execution policy in a slow-yet-adaptive manner.
This is the first regret-optimal model-free algorithm in the discounted setting, with the additional benefit of a low burn-in time.
arXiv Detail & Related papers (2023-05-24T20:22:43Z) - Autoregressive Bandits [58.46584210388307]
We propose a novel online learning setting, Autoregressive Bandits, in which the observed reward is governed by an autoregressive process of order $k$.
We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed.
We then devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $widetildemathcalO left( frac(k+1)3/2sqrtnT (1-G
arXiv Detail & Related papers (2022-12-12T21:37:36Z) - Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used.
Second, to explain these findings we introduce the concept of committal rate for policy optimization.
Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z) - Lexicographic Optimisation of Conditional Value at Risk and Expected
Value for Risk-Averse Planning in MDPs [4.87191262649216]
Planning in Markov decision processes (MDPs) typically optimises the expected cost.
An alternative approach is to find a policy which optimises a risk-averse objective such as conditional value at risk (CVaR)
We formulate the lexicographic optimisation problem of minimising the expected cost subject to the constraint that the CVaR of the total cost is optimal.
arXiv Detail & Related papers (2021-10-25T09:16:50Z) - Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs [113.8752163061151]
We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs)
We propose the underlineperiodically underlinerestarted underlineoptimistic underlinepolicy underlineoptimization algorithm (PROPO)
PROPO features two mechanisms: sliding-window-based policy evaluation and periodic-restart-based policy improvement.
arXiv Detail & Related papers (2021-10-18T02:33:20Z) - On the Optimality of Batch Policy Optimization Algorithms [106.89498352537682]
Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment.
We show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral.
We introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.
arXiv Detail & Related papers (2021-04-06T05:23:20Z) - Optimistic Policy Optimization with Bandit Feedback [70.75568142146493]
We propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $tilde O(sqrtS2 A H4 K)$ regret for previous rewards.
To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
arXiv Detail & Related papers (2020-02-19T15:41:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.