Maillard Sampling: Boltzmann Exploration Done Optimally
- URL: http://arxiv.org/abs/2111.03290v1
- Date: Fri, 5 Nov 2021 06:50:22 GMT
- Title: Maillard Sampling: Boltzmann Exploration Done Optimally
- Authors: Jie Bian, Kwang-Sung Jun
- Abstract summary: This thesis presents a randomized algorithm for the $K$-armed bandit problem.
Maillard sampling (MS) computes the probability of choosing each arm in a closed form.
We propose a variant of MS called MS$+$ that improves its minimax bound to $sqrtKTlogK$ without losing the optimality.
- Score: 11.282341369957216
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The PhD thesis of Maillard (2013) presents a randomized algorithm for the
$K$-armed bandit problem. This less-known algorithm, which we call Maillard
sampling (MS), computes the probability of choosing each arm in a closed form,
which is useful for counterfactual evaluation from bandit-logged data but was
lacking from Thompson sampling, a widely-adopted bandit algorithm in the
industry. Motivated by such merit, we revisit MS and perform an improved
analysis to show that it achieves both the asymptotical optimality and
$\sqrt{KT\log{T}}$ minimax regret bound where $T$ is the time horizon, which
matches the standard asymptotically optimal UCB's performance. We then propose
a variant of MS called MS$^+$ that improves its minimax bound to
$\sqrt{KT\log{K}}$ without losing the asymptotic optimality. MS$^+$ can also be
tuned to be aggressive (i.e., less exploration) without losing theoretical
guarantees, a unique feature unavailable from existing bandit algorithms. Our
numerical evaluation shows the effectiveness of MS$^+$.
Related papers
- Minimum Empirical Divergence for Sub-Gaussian Linear Bandits [10.750348548547704]
LinMED is a randomized algorithm that admits a closed-form computation of the arm sampling probabilities.
Our empirical study shows that LinMED has a competitive performance with the state-of-the-art algorithms.
arXiv Detail & Related papers (2024-10-31T21:54:44Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Gaussian Process Bandit Optimization with Few Batches [49.896920704012395]
We introduce a batch algorithm inspired by finite-arm bandit algorithms.
We show that it achieves the cumulative regret upper bound $Oast(sqrtTgamma_T)$ using $O(loglog T)$ batches within time horizon $T$.
In addition, we propose a modified version of our algorithm, and characterize how the regret is impacted by the number of batches.
arXiv Detail & Related papers (2021-10-15T00:54:04Z) - Thompson Sampling for Unimodal Bandits [21.514495320038712]
We propose a Thompson Sampling algorithm for emphunimodal bandits, where the expected reward is unimodal over the partially ordered arms.
For Gaussian rewards, the regret of our algorithm is $mathcalO(log T)$, which is far better than standard Thompson Sampling algorithms.
arXiv Detail & Related papers (2021-06-15T14:40:34Z) - Multinomial Logit Contextual Bandits: Provable Optimality and
Practicality [15.533842336139063]
We consider a sequential assortment selection problem where the user choice is given by a multinomial logit (MNL) choice model whose parameters are unknown.
We propose upper confidence bound based algorithms for this MNL contextual bandit.
We show that a simple variant of the algorithm achieves the optimal regret for a broad class of important applications.
arXiv Detail & Related papers (2021-03-25T15:42:25Z) - Provably Breaking the Quadratic Error Compounding Barrier in Imitation
Learning, Optimally [58.463668865380946]
We study the statistical limits of Imitation Learning in episodic Markov Decision Processes (MDPs) with a state space $mathcalS$.
We establish an upper bound $O(|mathcalS|H3/2/N)$ for the suboptimality using the Mimic-MD algorithm in Rajaraman et al ( 2020)
We show the minimax suboptimality grows as $Omega( H3/2/N)$ when $mathcalS|geq 3$ while the unknown-transition setting suffers from a larger sharp rate
arXiv Detail & Related papers (2021-02-25T15:50:19Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - MOTS: Minimax Optimal Thompson Sampling [89.2370817955411]
It has remained an open problem whether Thompson sampling can match the minimax lower bound $Omega(sqrtKT)$ for $K$-armed bandit problems.
We propose a variant of Thompson sampling called MOTS that adaptively clips the sampling instance of the chosen arm at each time step.
We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound $O(sqrtKT)$ for finite time horizon $T$, as well as the optimal regret bound for Gaussian rewards when $T$ approaches infinity.
arXiv Detail & Related papers (2020-03-03T21:24:39Z) - Model Selection in Contextual Stochastic Bandit Problems [51.94632035240787]
We develop a meta-algorithm that selects between base algorithms.
We show through a lower bound that even when one of the base algorithms has $O(sqrtT)$ regret, in general it is impossible to get better than $Omega(sqrtT)$ regret.
arXiv Detail & Related papers (2020-03-03T18:46:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.