Minimax Regret for Stochastic Shortest Path
- URL: http://arxiv.org/abs/2103.13056v1
- Date: Wed, 24 Mar 2021 10:11:49 GMT
- Title: Minimax Regret for Stochastic Shortest Path
- Authors: Alon Cohen, Yonathan Efroni, Yishay Mansour and Aviv Rosenberg
- Abstract summary: We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost.
We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state.
Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
- Score: 63.45407095296692
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the Stochastic Shortest Path (SSP) problem in which an agent has to
reach a goal state in minimum total expected cost. In the learning formulation
of the problem, the agent has no prior knowledge about the costs and dynamics
of the model. She repeatedly interacts with the model for $K$ episodes, and has
to learn to approximate the optimal policy as closely as possible. In this work
we show that the minimax regret for this setting is $\widetilde O(B_\star
\sqrt{|S| |A| K})$ where $B_\star$ is a bound on the expected cost of the
optimal policy from any state, $S$ is the state space, and $A$ is the action
space. This matches the lower bound of Rosenberg et al. (2020) up to
logarithmic factors, and improves their regret bound by a factor of
$\sqrt{|S|}$. Our algorithm runs in polynomial-time per episode, and is based
on a novel reduction to reinforcement learning in finite-horizon MDPs. To that
end, we provide an algorithm for the finite-horizon setting whose leading term
in the regret depends only logarithmically on the horizon, yielding the same
regret guarantees for SSP.
Related papers
- Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost.
We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z) - Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary
Environments [40.027926921772355]
We study the study of dynamic regret for goal-oriented reinforcement learning.
The different roles of $Delta_c$ and $Delta_P$ in this lower bound inspire us to design algorithms that estimate costs and transitions separately.
arXiv Detail & Related papers (2022-05-25T20:29:01Z) - Stochastic Shortest Path: Minimax, Parameter-Free and Towards
Horizon-Free Regret [144.6358229217845]
We study the problem of learning in the shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state.
We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus.
We prove that EB-SSP achieves the minimax regret rate $widetildeO(B_star sqrtS A K)$, where $K$ is the number of episodes, $S$ is the number of states, $A$
arXiv Detail & Related papers (2021-04-22T17:20:48Z) - Minimax Regret for Stochastic Shortest Path with Adversarial Costs and
Known Transition [37.6975819766632]
We study the shortest path problem with adversarial costs and known transition.
We show that the minimax regret is $widetildeO(sqrtDTstar K)$ and $widetildeO(sqrtDTstar SA K)$ for the full-information setting and the bandit feedback setting.
arXiv Detail & Related papers (2020-12-07T20:55:28Z) - Near-optimal Regret Bounds for Stochastic Shortest Path [63.029132134792555]
shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost.
We show that any learning algorithm must have at least $Omega(B_star sqrt|S| |A| K)$ regret in the worst case.
arXiv Detail & Related papers (2020-02-23T09:10:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.