Online Learning for Stochastic Shortest Path Model via Posterior
Sampling
- URL: http://arxiv.org/abs/2106.05335v1
- Date: Wed, 9 Jun 2021 18:46:39 GMT
- Title: Online Learning for Stochastic Shortest Path Model via Posterior
Sampling
- Authors: Mehdi Jafarnia-Jahromi, Liyu Chen, Rahul Jain, Haipeng Luo
- Abstract summary: PSRL-SSP is a posterior sampling-based reinforcement learning algorithm for the Shortest Path (SSP) problem.
It is the first such posterior sampling algorithm and outperforms numerically previously proposed optimism-based algorithms.
- Score: 29.289190242826688
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the problem of online reinforcement learning for the Stochastic
Shortest Path (SSP) problem modeled as an unknown MDP with an absorbing state.
We propose PSRL-SSP, a simple posterior sampling-based reinforcement learning
algorithm for the SSP problem. The algorithm operates in epochs. At the
beginning of each epoch, a sample is drawn from the posterior distribution on
the unknown model dynamics, and the optimal policy with respect to the drawn
sample is followed during that epoch. An epoch completes if either the number
of visits to the goal state in the current epoch exceeds that of the previous
epoch, or the number of visits to any of the state-action pairs is doubled. We
establish a Bayesian regret bound of $O(B_\star S\sqrt{AK})$, where $B_\star$
is an upper bound on the expected cost of the optimal policy, $S$ is the size
of the state space, $A$ is the size of the action space, and $K$ is the number
of episodes. The algorithm only requires the knowledge of the prior
distribution, and has no hyper-parameters to tune. It is the first such
posterior sampling algorithm and outperforms numerically previously proposed
optimism-based algorithms.
Related papers
- Finite-Sample Analysis of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning [0.0]
We prove a novel result on the convergence rate of the policy algorithm.
We show that the algorithm returns an optimal policy after $tildeO(SAK3log3frac1delta)$ sampled episodes.
arXiv Detail & Related papers (2024-10-03T21:11:29Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space [0.0]
We study the problem of optimal control of a family of discrete-time countable state-space Markov Decision Processes.
We propose an algorithm based on Thompson sampling with dynamically-sized episodes.
We show that our algorithm can be applied to develop approximately optimal control algorithms.
arXiv Detail & Related papers (2023-06-05T03:57:16Z) - Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL)
We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo.
Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z) - Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic
Shortest Path [106.37656068276902]
We study the sample complexity of learning an $epsilon$-optimal policy in the Shortest Path (SSP) problem.
We derive complexity bounds when the learner has access to a generative model.
We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_min$, and maximum expected cost of the optimal policy over all states $B_star$.
arXiv Detail & Related papers (2022-10-10T18:34:32Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - Stochastic Shortest Path: Minimax, Parameter-Free and Towards
Horizon-Free Regret [144.6358229217845]
We study the problem of learning in the shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state.
We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus.
We prove that EB-SSP achieves the minimax regret rate $widetildeO(B_star sqrtS A K)$, where $K$ is the number of episodes, $S$ is the number of states, $A$
arXiv Detail & Related papers (2021-04-22T17:20:48Z) - Model-based Reinforcement Learning for Continuous Control with Posterior
Sampling [10.91557009257615]
We study model-based posterior sampling for reinforcement learning (PSRL) in continuous state-action spaces.
We present MPC-PSRL, a model-based posterior sampling algorithm with model predictive control for action selection.
arXiv Detail & Related papers (2020-11-20T21:00:31Z) - Breaking the Sample Size Barrier in Model-Based Reinforcement Learning
with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator)
We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$.
We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.