Related papers: Finding the Stochastic Shortest Path with Low Regret: The Adversarial Cost and Unknown Transition Case

Finding the Stochastic Shortest Path with Low Regret: The Adversarial Cost and Unknown Transition Case

URL: http://arxiv.org/abs/2102.05284v1
Date: Wed, 10 Feb 2021 06:33:04 GMT
Title: Finding the Stochastic Shortest Path with Low Regret: The Adversarial Cost and Unknown Transition Case
Authors: Liyu Chen and Haipeng Luo
Abstract summary: We make significant progress toward the shortest path problem with adversarial costs and unknown transition. Specifically, we develop algorithms that achieve $widetildeO(sqrtS3A2DT_star K)$ regret for the full-information setting. We are also the first to consider the most challenging combination: bandit feedback with adversarial costs and unknown transition.
Score: 29.99619764839178
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We make significant progress toward the stochastic shortest path problem with adversarial costs and unknown transition. Specifically, we develop algorithms that achieve $\widetilde{O}(\sqrt{S^2ADT_\star K})$ regret for the full-information setting and $\widetilde{O}(\sqrt{S^3A^2DT_\star K})$ regret for the bandit feedback setting, where $D$ is the diameter, $T_\star$ is the expected hitting time of the optimal policy, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes. Our work strictly improves (Rosenberg and Mansour, 2020) in the full information setting, extends (Chen et al., 2020) from known transition to unknown transition, and is also the first to consider the most challenging combination: bandit feedback with adversarial costs and unknown transition. To remedy the gap between our upper bounds and the current best lower bounds constructed via a stochastically oblivious adversary, we also propose algorithms with near-optimal regret for this special case.

Related papers

Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback [49.84060509296641]
We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit) Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory. We introduce the first Policy Optimization algorithms for this setting.
arXiv Detail & Related papers (2025-02-06T12:03:24Z)
Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses. We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z)
Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel. An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost. Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z)
Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary Environments [40.027926921772355]
We study the study of dynamic regret for goal-oriented reinforcement learning. The different roles of $Delta_c$ and $Delta_P$ in this lower bound inspire us to design algorithms that estimate costs and transitions separately.
arXiv Detail & Related papers (2022-05-25T20:29:01Z)
Corralling a Larger Band of Bandits: A Case Study on Switching Regret for Linear Bandits [99.86860277006318]
We consider the problem of combining and learning over a set of adversarial algorithms with the goal of adaptively tracking the best one on the fly. The CORRAL of Agarwal et al. achieves this goal with a regret overhead of order $widetildeO(sqrtd S T)$ where $M$ is the number of base algorithms and $T$ is the time horizon. Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only emphlogarithmic dependence on $M$ as long
arXiv Detail & Related papers (2022-02-12T21:55:44Z)
The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition [49.78053380710322]
We consider the best-of-both-worlds problem for learning an episodic Markov Decision Process through $T$ episodes. Recent work by [Jin and Luo, 2020] achieves this goal when the fixed transition is known. In this work, we resolve this open problem by using the same Follow-the-Regularized-Leader ($textFTRL$) framework together with a set of new techniques.
arXiv Detail & Related papers (2021-06-08T05:46:35Z)
Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost. We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state. Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z)
Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition [37.6975819766632]
We study the shortest path problem with adversarial costs and known transition. We show that the minimax regret is $widetildeO(sqrtDTstar K)$ and $widetildeO(sqrtDTstar SA K)$ for the full-information setting and the bandit feedback setting.
arXiv Detail & Related papers (2020-12-07T20:55:28Z)
Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition [38.28925339231888]
We develop the first algorithm with a best-of-both-worlds'' guarantee. It achieves $mathcalO(log T)$ regret when the losses are adversarial. More generally, it achieves $tildemathcalO(sqrtC)$ regret in an intermediate setting.
arXiv Detail & Related papers (2020-06-10T01:59:34Z)
Near-optimal Regret Bounds for Stochastic Shortest Path [63.029132134792555]
shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost. We show that any learning algorithm must have at least $Omega(B_star sqrt|S| |A| K)$ regret in the worst case.
arXiv Detail & Related papers (2020-02-23T09:10:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.