Near-optimal Regret Bounds for Stochastic Shortest Path
- URL: http://arxiv.org/abs/2002.09869v1
- Date: Sun, 23 Feb 2020 09:10:14 GMT
- Title: Near-optimal Regret Bounds for Stochastic Shortest Path
- Authors: Alon Cohen, Haim Kaplan, Yishay Mansour and Aviv Rosenberg
- Abstract summary: shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost.
We show that any learning algorithm must have at least $Omega(B_star sqrt|S| |A| K)$ regret in the worst case.
- Score: 63.029132134792555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic shortest path (SSP) is a well-known problem in planning and
control, in which an agent has to reach a goal state in minimum total expected
cost. In the learning formulation of the problem, the agent is unaware of the
environment dynamics (i.e., the transition function) and has to repeatedly play
for a given number of episodes while reasoning about the problem's optimal
solution. Unlike other well-studied models in reinforcement learning (RL), the
length of an episode is not predetermined (or bounded) and is influenced by the
agent's actions. Recently, Tarbouriech et al. (2019) studied this problem in
the context of regret minimization and provided an algorithm whose regret bound
is inversely proportional to the square root of the minimum instantaneous cost.
In this work we remove this dependence on the minimum cost---we give an
algorithm that guarantees a regret bound of $\widetilde{O}(B_\star |S|
\sqrt{|A| K})$, where $B_\star$ is an upper bound on the expected cost of the
optimal policy, $S$ is the set of states, $A$ is the set of actions and $K$ is
the number of episodes. We additionally show that any learning algorithm must
have at least $\Omega(B_\star \sqrt{|S| |A| K})$ regret in the worst case.
Related papers
- Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Minimal Expected Regret in Linear Quadratic Control [79.81807680370677]
We devise an online learning algorithm and provide guarantees on its expected regret.
This regret at time $T$ is upper bounded (i) by $widetildeO((d_u+d_x)sqrtd_xT)$ when $A$ and $B$ are unknown.
arXiv Detail & Related papers (2021-09-29T14:07:21Z) - Stochastic Shortest Path: Minimax, Parameter-Free and Towards
Horizon-Free Regret [144.6358229217845]
We study the problem of learning in the shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state.
We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus.
We prove that EB-SSP achieves the minimax regret rate $widetildeO(B_star sqrtS A K)$, where $K$ is the number of episodes, $S$ is the number of states, $A$
arXiv Detail & Related papers (2021-04-22T17:20:48Z) - Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost.
We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state.
Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z) - Finding the Stochastic Shortest Path with Low Regret: The Adversarial
Cost and Unknown Transition Case [29.99619764839178]
We make significant progress toward the shortest path problem with adversarial costs and unknown transition.
Specifically, we develop algorithms that achieve $widetildeO(sqrtS3A2DT_star K)$ regret for the full-information setting.
We are also the first to consider the most challenging combination: bandit feedback with adversarial costs and unknown transition.
arXiv Detail & Related papers (2021-02-10T06:33:04Z) - Minimax Regret for Stochastic Shortest Path with Adversarial Costs and
Known Transition [37.6975819766632]
We study the shortest path problem with adversarial costs and known transition.
We show that the minimax regret is $widetildeO(sqrtDTstar K)$ and $widetildeO(sqrtDTstar SA K)$ for the full-information setting and the bandit feedback setting.
arXiv Detail & Related papers (2020-12-07T20:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.