Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path
- URL: http://arxiv.org/abs/2402.08998v1
- Date: Wed, 14 Feb 2024 07:52:00 GMT
- Title: Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path
- Authors: Qiwei Di, Jiafan He, Dongruo Zhou, Quanquan Gu
- Abstract summary: We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
- Score: 80.60592344361073
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the Stochastic Shortest Path (SSP) problem with a linear mixture
transition kernel, where an agent repeatedly interacts with a stochastic
environment and seeks to reach certain goal state while minimizing the
cumulative cost. Existing works often assume a strictly positive lower bound of
the cost function or an upper bound of the expected length for the optimal
policy. In this paper, we propose a new algorithm to eliminate these
restrictive assumptions. Our algorithm is based on extended value iteration
with a fine-grained variance-aware confidence set, where the variance is
estimated recursively from high-order moments. Our algorithm achieves an
$\tilde{\mathcal O}(dB_*\sqrt{K})$ regret bound, where $d$ is the dimension of
the feature mapping in the linear transition kernel, $B_*$ is the upper bound
of the total cumulative cost for the optimal policy, and $K$ is the number of
episodes. Our regret upper bound matches the $\Omega(dB_*\sqrt{K})$ lower bound
of linear mixture SSPs in Min et al. (2022), which suggests that our algorithm
is nearly minimax optimal.
Related papers
- Learning Infinite-Horizon Average-Reward Linear Mixture MDPs of Bounded Span [16.49229317664822]
This paper proposes a computationally tractable algorithm for learning infinite-horizon average-reward linear mixture Markov decision processes (MDPs)
Our algorithm for linear mixture MDPs achieves a nearly minimax optimal regret upper bound of $widetildemathcalO(dsqrtmathrmsp(v*)T)$ over $T$ time steps.
arXiv Detail & Related papers (2024-10-19T05:45:50Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Learning Stochastic Shortest Path with Linear Function Approximation [74.08819218747341]
We study the shortest path (SSP) problem in reinforcement learning with linear function approximation, where the transition kernel is represented as a linear mixture of unknown models.
We propose a novel algorithm for learning the linear mixture SSP, which can attain a $tilde O(d B_star1.5sqrtK/c_min)$ regret.
arXiv Detail & Related papers (2021-10-25T08:34:00Z) - Regret Bounds for Stochastic Shortest Path Problems with Linear Function
Approximation [29.549633678730054]
We propose two algorithms for episodic shortest path problems with linear function approximation.
The first is computationally expensive but provably obtains $tildeO (sqrtB_star3 d3 K/cmin )$ regret, where $B_star$ is a upper bound on the optimal cost-to-go function.
The second is computationally efficient in practice, and we conjecture that it obtains the same regret bound.
arXiv Detail & Related papers (2021-05-04T16:05:08Z) - Stochastic Shortest Path: Minimax, Parameter-Free and Towards
Horizon-Free Regret [144.6358229217845]
We study the problem of learning in the shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state.
We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus.
We prove that EB-SSP achieves the minimax regret rate $widetildeO(B_star sqrtS A K)$, where $K$ is the number of episodes, $S$ is the number of states, $A$
arXiv Detail & Related papers (2021-04-22T17:20:48Z) - Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost.
We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state.
Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z) - Minimax Regret for Stochastic Shortest Path with Adversarial Costs and
Known Transition [37.6975819766632]
We study the shortest path problem with adversarial costs and known transition.
We show that the minimax regret is $widetildeO(sqrtDTstar K)$ and $widetildeO(sqrtDTstar SA K)$ for the full-information setting and the bandit feedback setting.
arXiv Detail & Related papers (2020-12-07T20:55:28Z) - Near-optimal Regret Bounds for Stochastic Shortest Path [63.029132134792555]
shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost.
We show that any learning algorithm must have at least $Omega(B_star sqrt|S| |A| K)$ regret in the worst case.
arXiv Detail & Related papers (2020-02-23T09:10:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.