Variational Bayesian Optimistic Sampling
- URL: http://arxiv.org/abs/2110.15688v1
- Date: Fri, 29 Oct 2021 11:28:29 GMT
- Title: Variational Bayesian Optimistic Sampling
- Authors: Brendan O'Donoghue and Tor Lattimore
- Abstract summary: We consider online sequential decision problems where an agent must balance exploration and exploitation.
We derive a set of Bayesian optimistic' policies which, in the multi-armed bandit case, includes the Thompson sampling policy.
We show that any algorithm producing policies in the optimistic set enjoys $tilde O(sqrtAT)$ Bayesian regret for a problem with $A$ actions after $T$ rounds.
- Score: 43.130465039780084
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider online sequential decision problems where an agent must balance
exploration and exploitation. We derive a set of Bayesian `optimistic' policies
which, in the stochastic multi-armed bandit case, includes the Thompson
sampling policy. We provide a new analysis showing that any algorithm producing
policies in the optimistic set enjoys $\tilde O(\sqrt{AT})$ Bayesian regret for
a problem with $A$ actions after $T$ rounds. We extend the regret analysis for
optimistic policies to bilinear saddle-point problems which include zero-sum
matrix games and constrained bandits as special cases. In this case we show
that Thompson sampling can produce policies outside of the optimistic set and
suffer linear regret in some instances. Finding a policy inside the optimistic
set amounts to solving a convex optimization problem and we call the resulting
algorithm `variational Bayesian optimistic sampling' (VBOS). The procedure
works for any posteriors, \ie, it does not require the posterior to have any
special properties, such as log-concavity, unimodality, or smoothness. The
variational view of the problem has many useful properties, including the
ability to tune the exploration-exploitation tradeoff, add regularization,
incorporate constraints, and linearly parameterize the policy.
Related papers
- Thompson Exploration with Best Challenger Rule in Best Arm
Identification [66.33448474838342]
We study the fixed-confidence best arm identification problem in the bandit framework.
We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z) - Low-Switching Policy Gradient with Exploration via Online Sensitivity
Sampling [23.989009116398208]
We design a low-switching sample-efficient policy optimization algorithm, LPO, with general non-linear function approximation.
We show that, our algorithm obtains an $varepsilon$-optimal policy with only $widetildeO(fractextpoly(d)varepsilon3)$ samples.
arXiv Detail & Related papers (2023-06-15T23:51:46Z) - Autoregressive Bandits [58.46584210388307]
We propose a novel online learning setting, Autoregressive Bandits, in which the observed reward is governed by an autoregressive process of order $k$.
We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed.
We then devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $widetildemathcalO left( frac(k+1)3/2sqrtnT (1-G
arXiv Detail & Related papers (2022-12-12T21:37:36Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Finite-Sample Analysis of Off-Policy Natural Actor-Critic Algorithm [4.932130498861987]
We provide finite-sample convergence guarantees for an off-policy variant of the natural actor-critic (NAC) algorithm based on Importance Sampling.
We show that the algorithm converges to a global optimal policy with a sample complexity of $mathcalO(epsilon-3log2(1/epsilon)$ under an appropriate choice of stepsizes.
arXiv Detail & Related papers (2021-02-18T13:22:59Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Optimistic Policy Optimization with Bandit Feedback [70.75568142146493]
We propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $tilde O(sqrtS2 A H4 K)$ regret for previous rewards.
To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
arXiv Detail & Related papers (2020-02-19T15:41:18Z) - Differentiable Bandit Exploration [38.81737411000074]
We learn such policies for an unknown distribution $mathcalP$ using samples from $mathcalP$.
Our approach is a form of meta-learning and exploits properties of $mathcalP$ without making strong assumptions about its form.
arXiv Detail & Related papers (2020-02-17T05:07:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.