Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic
Shortest Path
- URL: http://arxiv.org/abs/2210.04946v1
- Date: Mon, 10 Oct 2022 18:34:32 GMT
- Title: Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic
Shortest Path
- Authors: Liyu Chen, Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric
- Abstract summary: We study the sample complexity of learning an $epsilon$-optimal policy in the Shortest Path (SSP) problem.
We derive complexity bounds when the learner has access to a generative model.
We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_min$, and maximum expected cost of the optimal policy over all states $B_star$.
- Score: 106.37656068276902
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the sample complexity of learning an $\epsilon$-optimal policy in
the Stochastic Shortest Path (SSP) problem. We first derive sample complexity
bounds when the learner has access to a generative model. We show that there
exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost
$c_{\min}$, and maximum expected cost of the optimal policy over all states
$B_{\star}$, where any algorithm requires at least
$\Omega(SAB_{\star}^3/(c_{\min}\epsilon^2))$ samples to return an
$\epsilon$-optimal policy with high probability. Surprisingly, this implies
that whenever $c_{\min}=0$ an SSP problem may not be learnable, thus revealing
that learning in SSPs is strictly harder than in the finite-horizon and
discounted settings. We complement this result with lower bounds when prior
knowledge of the hitting time of the optimal policy is available and when we
restrict optimality by competing against policies with bounded hitting time.
Finally, we design an algorithm with matching upper bounds in these cases. This
settles the sample complexity of learning $\epsilon$-optimal polices in SSP
with generative models.
We also initiate the study of learning $\epsilon$-optimal policies without
access to a generative model (i.e., the so-called best-policy identification
problem), and show that sample-efficient learning is impossible in general. On
the other hand, efficient learning can be made possible if we assume the agent
can directly reach the goal state from any state by paying a fixed cost. We
then establish the first upper and lower bounds under this assumption.
Finally, using similar analytic tools, we prove that horizon-free regret is
impossible in SSPs under general costs, resolving an open problem in
(Tarbouriech et al., 2021c).
Related papers
- Finding good policies in average-reward Markov Decision Processes without prior knowledge [19.89784209009327]
We revisit the identification of an $varepsilon$-optimal policy in average-reward Decision (MDP)
By relying on a diameter estimation procedure, we propose the first algorithm for $(varepsilon,delta)$-PAC-PAC policy identification.
arXiv Detail & Related papers (2024-05-27T12:24:14Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Sharper Model-free Reinforcement Learning for Average-reward Markov
Decision Processes [21.77276136591518]
We develop provably efficient model-free reinforcement learning (RL) algorithms for Markov Decision Processes (MDPs)
In the simulator setting, we propose a model-free RL algorithm that finds an $epsilon$-optimal policy using $widetildeO left(fracSAmathrmsp(h*)epsilon2+fracS2Amathrmsp(h*)epsilon2right)$ samples.
arXiv Detail & Related papers (2023-06-28T17:43:19Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement
Learning with Latent Low-Rank Structure [9.759209713196718]
We consider a class of MDPs for which the associated optimal $Q*$ function is low rank, where the latent features are unknown.
We show that under stronger low rank structural assumptions, given access to a generative model, Low Rank Monte Carlo Policy Iteration (LR-MCPI) and Low Rank Empirical Value Iteration (LR-EVI) achieve the desired sample complexity of $tildeOleft((|S|+|A|)mathrmpoly(d,H)/epsilon2right)$ for a rank
arXiv Detail & Related papers (2022-06-07T20:39:51Z) - Stochastic Shortest Path: Minimax, Parameter-Free and Towards
Horizon-Free Regret [144.6358229217845]
We study the problem of learning in the shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state.
We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus.
We prove that EB-SSP achieves the minimax regret rate $widetildeO(B_star sqrtS A K)$, where $K$ is the number of episodes, $S$ is the number of states, $A$
arXiv Detail & Related papers (2021-04-22T17:20:48Z) - Improved Sample Complexity for Incremental Autonomous Exploration in
MDPs [132.88757893161699]
We learn the set of $epsilon$-optimal goal-conditioned policies attaining all states that are incrementally reachable within $L$ steps.
DisCo is the first algorithm that can return an $epsilon/c_min$-optimal policy for any cost-sensitive shortest-path problem.
arXiv Detail & Related papers (2020-12-29T14:06:09Z) - Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample
Complexity [59.34067736545355]
Given an MDP with $S$ states, $A$ actions, the discount factor $gamma in (0,1)$, and an approximation threshold $epsilon > 0$, we provide a model-free algorithm to learn an $epsilon$-optimal policy.
For small enough $epsilon$, we show an improved algorithm with sample complexity.
arXiv Detail & Related papers (2020-06-06T13:34:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.