Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive
Multi-Step Bootstrap
- URL: http://arxiv.org/abs/2102.04692v1
- Date: Tue, 9 Feb 2021 07:46:34 GMT
- Title: Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive
Multi-Step Bootstrap
- Authors: Haike Xu, Tengyu Ma, Simon S. Du
- Abstract summary: This paper presents a new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB)
We show AMB achieves a gap-dependent regret bound that only scales with the sum of the inverse of the sub-optimality gaps.
We also show AMB suffers an additional $frac|Z_mul|Delta_min$ regret, where $Z_mul$ is the set of state-action pairs $(s,a)$'s satisfying $a$ is a non-unique optimal action for
- Score: 84.66885506098724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a new model-free algorithm for episodic finite-horizon
Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB), which
enjoys a stronger gap-dependent regret bound. The first innovation is to
estimate the optimal $Q$-function by combining an optimistic bootstrap with an
adaptive multi-step Monte Carlo rollout. The second innovation is to select the
action with the largest confidence interval length among admissible actions
that are not dominated by any other actions. We show when each state has a
unique optimal action, AMB achieves a gap-dependent regret bound that only
scales with the sum of the inverse of the sub-optimality gaps. In contrast,
Simchowitz and Jamieson (2019) showed all upper-confidence-bound (UCB)
algorithms suffer an additional $\Omega\left(\frac{S}{\Delta_{min}}\right)$
regret due to over-exploration where $\Delta_{min}$ is the minimum
sub-optimality gap and $S$ is the number of states. We further show that for
general MDPs, AMB suffers an additional $\frac{|Z_{mul}|}{\Delta_{min}}$
regret, where $Z_{mul}$ is the set of state-action pairs $(s,a)$'s satisfying
$a$ is a non-unique optimal action for $s$. We complement our upper bound with
a lower bound showing the dependency on $\frac{|Z_{mul}|}{\Delta_{min}}$ is
unavoidable for any consistent algorithm. This lower bound also implies a
separation between reinforcement learning and contextual bandits.
Related papers
- Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Sharper Model-free Reinforcement Learning for Average-reward Markov
Decision Processes [21.77276136591518]
We develop provably efficient model-free reinforcement learning (RL) algorithms for Markov Decision Processes (MDPs)
In the simulator setting, we propose a model-free RL algorithm that finds an $epsilon$-optimal policy using $widetildeO left(fracSAmathrmsp(h*)epsilon2+fracS2Amathrmsp(h*)epsilon2right)$ samples.
arXiv Detail & Related papers (2023-06-28T17:43:19Z) - Adversarial Online Multi-Task Reinforcement Learning [12.421997449847153]
We consider the adversarial online multi-task reinforcement learning setting.
In each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models.
The learner's objective is to generalize its regret with respect to the optimal policy for each task.
arXiv Detail & Related papers (2023-01-11T02:18:26Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs [99.59319332864129]
We show that UCBVI-$gamma$ achieves an $tildeObig(sqrtSAT/ (1-gamma)1.5big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $gamma$ is the discount factor and $T$ is the number of steps.
In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $tildeOmegabig(sqrtSAT/
arXiv Detail & Related papers (2020-10-01T17:57:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.