Optimistic Posterior Sampling for Reinforcement Learning with Few
Samples and Tight Guarantees
- URL: http://arxiv.org/abs/2209.14414v1
- Date: Wed, 28 Sep 2022 20:49:34 GMT
- Title: Optimistic Posterior Sampling for Reinforcement Learning with Few
Samples and Tight Guarantees
- Authors: Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines,
Remi Munos, Alexey Naumov, Mark Rowland, Michal Valko, Pierre Menard
- Abstract summary: We propose an optimistic posterior sampling algorithm for reinforcement learning (OPSRL)
We guarantee a high-probability regret bound of order at most $widetildemathcalO(sqrtH3SAT)$ ignoring $textpolylog(HSAT)$ terms.
Our bound matches the lower bound of order $Omega(sqrtH3SAT)$, thereby answering the open problems raised by Agrawal and Jia.
- Score: 43.13918072870693
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider reinforcement learning in an environment modeled by an episodic,
finite, stage-dependent Markov decision process of horizon $H$ with $S$ states,
and $A$ actions. The performance of an agent is measured by the regret after
interacting with the environment for $T$ episodes. We propose an optimistic
posterior sampling algorithm for reinforcement learning (OPSRL), a simple
variant of posterior sampling that only needs a number of posterior samples
logarithmic in $H$, $S$, $A$, and $T$ per state-action pair. For OPSRL we
guarantee a high-probability regret bound of order at most
$\widetilde{\mathcal{O}}(\sqrt{H^3SAT})$ ignoring $\text{poly}\log(HSAT)$
terms. The key novel technical ingredient is a new sharp anti-concentration
inequality for linear forms which may be of independent interest. Specifically,
we extend the normal approximation-based lower bound for Beta distributions by
Alfers and Dinges [1984] to Dirichlet distributions. Our bound matches the
lower bound of order $\Omega(\sqrt{H^3SAT})$, thereby answering the open
problems raised by Agrawal and Jia [2017b] for the episodic setting.
Related papers
- Fast Rates for Bandit PAC Multiclass Classification [73.17969992976501]
We study multiclass PAC learning with bandit feedback, where inputs are classified into one of $K$ possible labels and feedback is limited to whether or not the predicted labels are correct.
Our main contribution is in designing a novel learning algorithm for the agnostic $(varepsilon,delta)$PAC version of the problem.
arXiv Detail & Related papers (2024-06-18T08:54:04Z) - Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost.
We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z) - The Sample Complexity of Approximate Rejection Sampling with
Applications to Smoothed Online Learning [29.44582058149344]
We show that the optimal total variation distance as a function of $n$ is given by $tildeTheta(fracDf'(n))$ over the class of all pairs $nu,mu$ with a bounded $f$-divergence.
We then consider an application in the seemingly very different field of smoothed online learning.
arXiv Detail & Related papers (2023-02-09T14:20:14Z) - Bridging Distributional and Risk-sensitive Reinforcement Learning with
Provable Regret Bounds [24.571530193140916]
We consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return.
We propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one.
We prove that they both attain $tildemathcalO(fracexp(|beta| H)-1|beta|HsqrtS2AK)$ regret upper bound, where $S$, $A$, $K$, and $H$ represent the number
arXiv Detail & Related papers (2022-10-25T14:30:48Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost.
We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state.
Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.