Related papers: Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

URL: http://arxiv.org/abs/2406.01234v1
Date: Mon, 3 Jun 2024 11:53:44 GMT
Title: Achieving Tractable Minimax Optimal Regret in Average Reward MDPs
Authors: Victor Boone, Zihan Zhang,
Abstract summary: We present the first tractable algorithm with minimax optimal regret of $widetildemathrmO(sqrtmathrmsp(h*) S A T)$. Remarkably, our algorithm does not require prior information on $mathrmsp(h*)$.
Score: 19.663336027878408
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs). However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies. In this paper, we present the first tractable algorithm with minimax optimal regret of $\widetilde{\mathrm{O}}(\sqrt{\mathrm{sp}(h^*) S A T})$, where $\mathrm{sp}(h^*)$ is the span of the optimal bias function $h^*$, $S \times A$ is the size of the state-action space and $T$ the number of learning steps. Remarkably, our algorithm does not require prior information on $\mathrm{sp}(h^*)$. Our algorithm relies on a novel subroutine, Projected Mitigated Extended Value Iteration (PMEVI), to compute bias-constrained optimal policies efficiently. This subroutine can be applied to various previous algorithms to improve regret bounds.

Related papers

Linear Bandits on Ellipsoids: Minimax Optimal Algorithms [5.678465386088928]
We consider computationally linear bandits where the set of actions is an ellipsoid. We provide the first known minimax optimal algorithm for this problem. A run requires only time $O(dT + d2 log(T/d) + d3)$ and memory $O(d2)$.
arXiv Detail & Related papers (2025-02-24T14:12:31Z)
Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation [1.8416014644193066]
We propose an algorithm for learning linear Markov decision processes (MDPs) and linear mixture MDPs under the Bellman optimality condition. Our algorithm for linear MDPs achieves the best-known regret upper bound of $widetildemathcalO(d3/2mathrmsp(v*)sqrtT)$ over $T$ time steps. For linear mixture MDPs, our algorithm attains a regret bound of $widetildemathcalO(dcdotmathrm
arXiv Detail & Related papers (2024-09-16T23:13:42Z)
Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel. An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost. Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z)
A Doubly Robust Approach to Sparse Reinforcement Learning [19.68978899041642]
We propose a new regret algorithm for episodic sparse linear Markov decision process (SMDP) The proposed algorithm is $tildeO(sigma-1_min s_star H sqrtN)$, where $sigma_min$ denotes the restrictive minimum eigenvalue of the average Gram matrix of feature vectors.
arXiv Detail & Related papers (2023-10-23T18:52:17Z)
Efficient Rate Optimal Regret for Adversarial Contextual MDPs Using Online Function Approximation [47.18926328995424]
We present the OMG-CMDP! algorithm for regret minimization in adversarial Contextual MDPs. Our algorithm is efficient (assuming efficient online regression oracles) and simple and robust to approximation errors.
arXiv Detail & Related papers (2023-03-02T18:27:00Z)
Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations. Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z)
Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes. This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z)
Learning Infinite-Horizon Average-Reward Markov Decision Processes with Constraints [39.715977181666766]
We study regret for infinite-horizon average-reward Markov Decision Processes (MDPs) under cost constraints. Our algorithm ensures $widetildeO(sqrtT)$ regret and constant constraint violation for ergodic MDPs. These are the first set of provable algorithms for weakly communicating MDPs with cost constraints.
arXiv Detail & Related papers (2022-01-31T23:52:34Z)
Nearly Optimal Policy Optimization with Stable at Any Time Guarantee [53.155554415415445]
Policy-based method in citetshani 2020optimistic is only $tildeO(sqrtSAH3K + sqrtAH4K)$ where $S$ is the number of states, $A$ is the number of actions, $H$ is the horizon, and $K$ is the number of episodes, and there is a $sqrtSH$ gap compared with the information theoretic lower bound $tildeOmega(sqrtSAH
arXiv Detail & Related papers (2021-12-21T01:54:17Z)
Gaussian Process Bandit Optimization with Few Batches [49.896920704012395]
We introduce a batch algorithm inspired by finite-arm bandit algorithms. We show that it achieves the cumulative regret upper bound $Oast(sqrtTgamma_T)$ using $O(loglog T)$ batches within time horizon $T$. In addition, we propose a modified version of our algorithm, and characterize how the regret is impacted by the number of batches.
arXiv Detail & Related papers (2021-10-15T00:54:04Z)
Randomized Exploration for Reinforcement Learning with General Value Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm. Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.