Minimax Regret for Stochastic Shortest Path with Adversarial Costs and
Known Transition
- URL: http://arxiv.org/abs/2012.04053v2
- Date: Wed, 10 Feb 2021 06:04:44 GMT
- Title: Minimax Regret for Stochastic Shortest Path with Adversarial Costs and
Known Transition
- Authors: Liyu Chen, Haipeng Luo, Chen-Yu Wei
- Abstract summary: We study the shortest path problem with adversarial costs and known transition.
We show that the minimax regret is $widetildeO(sqrtDTstar K)$ and $widetildeO(sqrtDTstar SA K)$ for the full-information setting and the bandit feedback setting.
- Score: 37.6975819766632
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the stochastic shortest path problem with adversarial costs and
known transition, and show that the minimax regret is
$\widetilde{O}(\sqrt{DT^\star K})$ and $\widetilde{O}(\sqrt{DT^\star SA K})$
for the full-information setting and the bandit feedback setting respectively,
where $D$ is the diameter, $T^\star$ is the expected hitting time of the
optimal policy, $S$ is the number of states, $A$ is the number of actions, and
$K$ is the number of episodes. Our results significantly improve upon the
existing work of (Rosenberg and Mansour, 2020) which only considers the
full-information setting and achieves suboptimal regret. Our work is also the
first to consider bandit feedback with adversarial costs.
Our algorithms are built on top of the Online Mirror Descent framework with a
variety of new techniques that might be of independent interest, including an
improved multi-scale expert algorithm, a reduction from general stochastic
shortest path to a special loop-free case, a skewed occupancy measure space,
%the usage of log-barrier with an increasing learning rate schedule, and a
novel correction term added to the cost estimators. Interestingly, the last two
elements reduce the variance of the learner via positive bias and the variance
of the optimal policy via negative bias respectively, and having them
simultaneously is critical for obtaining the optimal high-probability bound in
the bandit feedback setting.
Related papers
- Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Stochastic Shortest Path: Minimax, Parameter-Free and Towards
Horizon-Free Regret [144.6358229217845]
We study the problem of learning in the shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state.
We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus.
We prove that EB-SSP achieves the minimax regret rate $widetildeO(B_star sqrtS A K)$, where $K$ is the number of episodes, $S$ is the number of states, $A$
arXiv Detail & Related papers (2021-04-22T17:20:48Z) - Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost.
We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state.
Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z) - Finding the Stochastic Shortest Path with Low Regret: The Adversarial
Cost and Unknown Transition Case [29.99619764839178]
We make significant progress toward the shortest path problem with adversarial costs and unknown transition.
Specifically, we develop algorithms that achieve $widetildeO(sqrtS3A2DT_star K)$ regret for the full-information setting.
We are also the first to consider the most challenging combination: bandit feedback with adversarial costs and unknown transition.
arXiv Detail & Related papers (2021-02-10T06:33:04Z) - Near-optimal Regret Bounds for Stochastic Shortest Path [63.029132134792555]
shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost.
We show that any learning algorithm must have at least $Omega(B_star sqrt|S| |A| K)$ regret in the worst case.
arXiv Detail & Related papers (2020-02-23T09:10:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.