Related papers: Regret Bounds for Stochastic Shortest Path Problems with Linear Function Approximation

Regret Bounds for Stochastic Shortest Path Problems with Linear Function Approximation

URL: http://arxiv.org/abs/2105.01593v1
Date: Tue, 4 May 2021 16:05:08 GMT
Title: Regret Bounds for Stochastic Shortest Path Problems with Linear Function Approximation
Authors: Daniel Vial, Advait Parulekar, Sanjay Shakkottai, R. Srikant
Abstract summary: We propose two algorithms for episodic shortest path problems with linear function approximation. The first is computationally expensive but provably obtains $tildeO (sqrtB_star3 d3 K/cmin )$ regret, where $B_star$ is a upper bound on the optimal cost-to-go function. The second is computationally efficient in practice, and we conjecture that it obtains the same regret bound.
Score: 29.549633678730054
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose two algorithms for episodic stochastic shortest path problems with linear function approximation. The first is computationally expensive but provably obtains $\tilde{O} (\sqrt{B_\star^3 d^3 K/c_{min}} )$ regret, where $B_\star$ is a (known) upper bound on the optimal cost-to-go function, $d$ is the feature dimension, $K$ is the number of episodes, and $c_{min}$ is the minimal cost of non-goal state-action pairs (assumed to be positive). The second is computationally efficient in practice, and we conjecture that it obtains the same regret bound. Both algorithms are based on an optimistic least-squares version of value iteration analogous to the finite-horizon backward induction approach from Jin et al. 2020. To the best of our knowledge, these are the first regret bounds for stochastic shortest path that are independent of the size of the state and action spaces.

Related papers

Linear Bandits on Ellipsoids: Minimax Optimal Algorithms [5.678465386088928]
We consider computationally linear bandits where the set of actions is an ellipsoid. We provide the first known minimax optimal algorithm for this problem. A run requires only time $O(dT + d2 log(T/d) + d3)$ and memory $O(d2)$.
arXiv Detail & Related papers (2025-02-24T14:12:31Z)
Achieving Tractable Minimax Optimal Regret in Average Reward MDPs [19.663336027878408]
We present the first tractable algorithm with minimax optimal regret of $widetildemathrmO(sqrtmathrmsp(h*) S A T)$. Remarkably, our algorithm does not require prior information on $mathrmsp(h*)$.
arXiv Detail & Related papers (2024-06-03T11:53:44Z)
Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation [67.8414514524356]
We study a new class of MDPs that employs multinomial logit (MNL) function approximation to ensure valid probability distributions over the state space. introducing the non-linear function raises significant challenges in both computational and statistical efficiency. We propose an algorithm that achieves the same regret with only $mathcalO(1)$ cost.
arXiv Detail & Related papers (2024-05-27T11:31:54Z)
Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel. An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost. Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z)
Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations. Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z)
Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes. This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z)
Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP [31.62899359543925]
We introduce two new no-regret algorithms for the shortest path (SSP) problem with a linear MDP. Our first algorithm is computationally efficient and achieves a regret bound $widetildeOleft(sqrtd3B_star2T_star Kright)$. Our second algorithm is computationally inefficient but achieves the first "horizon-free" regret bound $widetildeO(d3.5B_starsqrtK)$ with no dependency on $T_star
arXiv Detail & Related papers (2021-12-18T06:47:31Z)
Learning Stochastic Shortest Path with Linear Function Approximation [74.08819218747341]
We study the shortest path (SSP) problem in reinforcement learning with linear function approximation, where the transition kernel is represented as a linear mixture of unknown models. We propose a novel algorithm for learning the linear mixture SSP, which can attain a $tilde O(d B_star1.5sqrtK/c_min)$ regret.
arXiv Detail & Related papers (2021-10-25T08:34:00Z)
Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost. We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state. Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z)
Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback. We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode. We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z)
Near-optimal Regret Bounds for Stochastic Shortest Path [63.029132134792555]
shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost. We show that any learning algorithm must have at least $Omega(B_star sqrt|S| |A| K)$ regret in the worst case.
arXiv Detail & Related papers (2020-02-23T09:10:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.