Related papers: Thompson Sampling for Unimodal Bandits

Thompson Sampling for Unimodal Bandits

URL: http://arxiv.org/abs/2106.08187v2
Date: Wed, 16 Jun 2021 04:51:04 GMT
Title: Thompson Sampling for Unimodal Bandits
Authors: Long Yang, Zhao Li, Zehong Hu, Shasha Ruan, Shijian Li, Gang Pan, Hongyang Chen
Abstract summary: We propose a Thompson Sampling algorithm for emphunimodal bandits, where the expected reward is unimodal over the partially ordered arms. For Gaussian rewards, the regret of our algorithm is $mathcalO(log T)$, which is far better than standard Thompson Sampling algorithms.
Score: 21.514495320038712
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we propose a Thompson Sampling algorithm for \emph{unimodal} bandits, where the expected reward is unimodal over the partially ordered arms. To exploit the unimodal structure better, at each step, instead of exploration from the entire decision space, our algorithm makes decision according to posterior distribution only in the neighborhood of the arm that has the highest empirical mean estimate. We theoretically prove that, for Bernoulli rewards, the regret of our algorithm reaches the lower bound of unimodal bandits, thus it is asymptotically optimal. For Gaussian rewards, the regret of our algorithm is $\mathcal{O}(\log T)$, which is far better than standard Thompson Sampling algorithms. Extensive experiments demonstrate the effectiveness of the proposed algorithm on both synthetic data sets and the real-world applications.

Related papers

Langevin Monte Carlo for Contextual Bandits [72.00524614312002]
Langevin Monte Carlo Thompson Sampling (LMC-TS) is proposed to directly sample from the posterior distribution in contextual bandits. We prove that the proposed algorithm achieves the same sublinear regret bound as the best Thompson sampling algorithms for a special case of contextual bandits.
arXiv Detail & Related papers (2022-06-22T17:58:23Z)
Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits [88.21288104408556]
We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family. We propose a Thompson sampling, termed Expulli, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm.
arXiv Detail & Related papers (2022-06-07T18:08:21Z)
Mean-based Best Arm Identification in Stochastic Bandits under Reward Contamination [80.53485617514707]
This paper proposes two algorithms, a gap-based algorithm and one based on the successive elimination, for best arm identification in sub-Gaussian bandits. Specifically, for the gap-based algorithm, the sample complexity is optimal up to constant factors, while for the successive elimination, it is optimal up to logarithmic factors.
arXiv Detail & Related papers (2021-11-14T21:49:58Z)
Maillard Sampling: Boltzmann Exploration Done Optimally [11.282341369957216]
This thesis presents a randomized algorithm for the $K$-armed bandit problem. Maillard sampling (MS) computes the probability of choosing each arm in a closed form. We propose a variant of MS called MS$+$ that improves its minimax bound to $sqrtKTlogK$ without losing the optimality.
arXiv Detail & Related papers (2021-11-05T06:50:22Z)
Neural Thompson Sampling [94.82847209157494]
We propose a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation. At the core of our algorithm is a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network.
arXiv Detail & Related papers (2020-10-02T07:44:09Z)
Corralling Stochastic Bandit Algorithms [54.10645564702416]
We show that the regret of the corralling algorithms is no worse than that of the best algorithm containing the arm with the highest reward. We show that the gap between the highest reward and other rewards depends on the gap between the highest reward and other rewards.
arXiv Detail & Related papers (2020-06-16T15:33:12Z)
MOTS: Minimax Optimal Thompson Sampling [89.2370817955411]
It has remained an open problem whether Thompson sampling can match the minimax lower bound $Omega(sqrtKT)$ for $K$-armed bandit problems. We propose a variant of Thompson sampling called MOTS that adaptively clips the sampling instance of the chosen arm at each time step. We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound $O(sqrtKT)$ for finite time horizon $T$, as well as the optimal regret bound for Gaussian rewards when $T$ approaches infinity.
arXiv Detail & Related papers (2020-03-03T21:24:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.