Randomized Exploration is Near-Optimal for Tabular MDP
- URL: http://arxiv.org/abs/2102.09703v1
- Date: Fri, 19 Feb 2021 01:42:50 GMT
- Title: Randomized Exploration is Near-Optimal for Tabular MDP
- Authors: Zhihan Xiong, Ruoqi Shen, Simon S. Du
- Abstract summary: We study exploration using randomized value functions in Thompson Sampling (TS)-like algorithms in reinforcement learning.
We show that when we use 1) a single random seed in each episode, and 2) a Bernstein-type magnitude of noise, we obtain a worst-case $widetildeOleft(HsqrtSATright)$ regret bound for episodic time-inhomogeneous Decision Process.
- Score: 45.16374124699648
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study exploration using randomized value functions in Thompson Sampling
(TS)-like algorithms in reinforcement learning. This type of algorithms enjoys
appealing empirical performance. We show that when we use 1) a single random
seed in each episode, and 2) a Bernstein-type magnitude of noise, we obtain a
worst-case $\widetilde{O}\left(H\sqrt{SAT}\right)$ regret bound for episodic
time-inhomogeneous Markov Decision Process where $S$ is the size of state
space, $A$ is the size of action space, $H$ is the planning horizon and $T$ is
the number of interactions. This bound polynomially improves all existing
bounds for TS-like algorithms based on randomized value functions, and for the
first time, matches the $\Omega\left(H\sqrt{SAT}\right)$ lower bound up to
logarithmic factors. Our result highlights that randomized exploration can be
near-optimal, which was previously only achieved by optimistic algorithms.
Related papers
- Achieving Tractable Minimax Optimal Regret in Average Reward MDPs [19.663336027878408]
We present the first tractable algorithm with minimax optimal regret of $widetildemathrmO(sqrtmathrmsp(h*) S A T)$.
Remarkably, our algorithm does not require prior information on $mathrmsp(h*)$.
arXiv Detail & Related papers (2024-06-03T11:53:44Z) - Randomized Exploration for Reinforcement Learning with Multinomial Logistic Function Approximation [8.274693573069442]
We study reinforcement learning with multinomial logistic (MNL) function approximation.
We propose provably efficient algorithms with randomized exploration having frequentist regret guarantees.
Numerical experiments demonstrate the superior performance of the proposed algorithms.
arXiv Detail & Related papers (2024-05-30T15:39:19Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Near-Optimal Bounds for Learning Gaussian Halfspaces with Random
Classification Noise [50.64137465792738]
We show that any efficient SQ algorithm for the problem requires sample complexity at least $Omega(d1/2/(maxp, epsilon)2)$.
Our lower bound suggests that this quadratic dependence on $1/epsilon$ is inherent for efficient algorithms.
arXiv Detail & Related papers (2023-07-13T18:59:28Z) - Sharper Model-free Reinforcement Learning for Average-reward Markov
Decision Processes [21.77276136591518]
We develop provably efficient model-free reinforcement learning (RL) algorithms for Markov Decision Processes (MDPs)
In the simulator setting, we propose a model-free RL algorithm that finds an $epsilon$-optimal policy using $widetildeO left(fracSAmathrmsp(h*)epsilon2+fracS2Amathrmsp(h*)epsilon2right)$ samples.
arXiv Detail & Related papers (2023-06-28T17:43:19Z) - Nearly Optimal Policy Optimization with Stable at Any Time Guarantee [53.155554415415445]
Policy-based method in citetshani 2020optimistic is only $tildeO(sqrtSAH3K + sqrtAH4K)$ where $S$ is the number of states, $A$ is the number of actions, $H$ is the horizon, and $K$ is the number of episodes, and there is a $sqrtSH$ gap compared with the information theoretic lower bound $tildeOmega(sqrtSAH
arXiv Detail & Related papers (2021-12-21T01:54:17Z) - Randomized Exploration for Reinforcement Learning with General Value
Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm.
Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises.
We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z) - Streaming Complexity of SVMs [110.63976030971106]
We study the space complexity of solving the bias-regularized SVM problem in the streaming model.
We show that for both problems, for dimensions of $frac1lambdaepsilon$, one can obtain streaming algorithms with spacely smaller than $frac1lambdaepsilon$.
arXiv Detail & Related papers (2020-07-07T17:10:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.