Stochastic Shortest Path: Minimax, Parameter-Free and Towards
Horizon-Free Regret
- URL: http://arxiv.org/abs/2104.11186v1
- Date: Thu, 22 Apr 2021 17:20:48 GMT
- Title: Stochastic Shortest Path: Minimax, Parameter-Free and Towards
Horizon-Free Regret
- Authors: Jean Tarbouriech, Runlong Zhou, Simon S. Du, Matteo Pirotta, Michal
Valko, Alessandro Lazaric
- Abstract summary: We study the problem of learning in the shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state.
We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus.
We prove that EB-SSP achieves the minimax regret rate $widetildeO(B_star sqrtS A K)$, where $K$ is the number of episodes, $S$ is the number of states, $A$
- Score: 144.6358229217845
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of learning in the stochastic shortest path (SSP)
setting, where an agent seeks to minimize the expected cost accumulated before
reaching a goal state. We design a novel model-based algorithm EB-SSP that
carefully skews the empirical transitions and perturbs the empirical costs with
an exploration bonus to guarantee both optimism and convergence of the
associated value iteration scheme. We prove that EB-SSP achieves the minimax
regret rate $\widetilde{O}(B_{\star} \sqrt{S A K})$, where $K$ is the number of
episodes, $S$ is the number of states, $A$ is the number of actions and
$B_{\star}$ bounds the expected cumulative cost of the optimal policy from any
state, thus closing the gap with the lower bound. Interestingly, EB-SSP obtains
this result while being parameter-free, i.e., it does not require any prior
knowledge of $B_{\star}$, nor of $T_{\star}$ which bounds the expected
time-to-goal of the optimal policy from any state. Furthermore, we illustrate
various cases (e.g., positive costs, or general costs when an order-accurate
estimate of $T_{\star}$ is available) where the regret only contains a
logarithmic dependence on $T_{\star}$, thus yielding the first horizon-free
regret bound beyond the finite-horizon MDP setting.
Related papers
- Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism [1.4999444543328293]
We propose a model-based algorithm based on novel cost and reward function estimators.
Our algorithm achieves a regret upper bound of $widetildemathcalO((bar C - bar C_b)-1H2.5 SsqrtAK)$ where $bar C$ is the cost budget for an episode.
arXiv Detail & Related papers (2024-10-14T04:51:06Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic
Shortest Path [106.37656068276902]
We study the sample complexity of learning an $epsilon$-optimal policy in the Shortest Path (SSP) problem.
We derive complexity bounds when the learner has access to a generative model.
We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_min$, and maximum expected cost of the optimal policy over all states $B_star$.
arXiv Detail & Related papers (2022-10-10T18:34:32Z) - Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost.
We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state.
Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z) - Minimax Regret for Stochastic Shortest Path with Adversarial Costs and
Known Transition [37.6975819766632]
We study the shortest path problem with adversarial costs and known transition.
We show that the minimax regret is $widetildeO(sqrtDTstar K)$ and $widetildeO(sqrtDTstar SA K)$ for the full-information setting and the bandit feedback setting.
arXiv Detail & Related papers (2020-12-07T20:55:28Z) - Stochastic Shortest Path with Adversarially Changing Costs [57.90236104782219]
shortest path (SSP) is a well-known problem in planning and control.
We present the adversarial SSP model that also accounts for adversarial changes in the costs over time.
We are the first to consider this natural setting of adversarial SSP and obtain sub-linear regret for it.
arXiv Detail & Related papers (2020-06-20T12:10:35Z) - Near-optimal Regret Bounds for Stochastic Shortest Path [63.029132134792555]
shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost.
We show that any learning algorithm must have at least $Omega(B_star sqrt|S| |A| K)$ regret in the worst case.
arXiv Detail & Related papers (2020-02-23T09:10:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.