Finite-Time Regret of Thompson Sampling Algorithms for Exponential
Family Multi-Armed Bandits
- URL: http://arxiv.org/abs/2206.03520v1
- Date: Tue, 7 Jun 2022 18:08:21 GMT
- Title: Finite-Time Regret of Thompson Sampling Algorithms for Exponential
Family Multi-Armed Bandits
- Authors: Tianyuan Jin, Pan Xu, Xiaokui Xiao, Anima Anandkumar
- Abstract summary: We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family.
We propose a Thompson sampling, termed Expulli, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm.
- Score: 88.21288104408556
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the regret of Thompson sampling (TS) algorithms for exponential
family bandits, where the reward distribution is from a one-dimensional
exponential family, which covers many common reward distributions including
Bernoulli, Gaussian, Gamma, Exponential, etc. We propose a Thompson sampling
algorithm, termed ExpTS, which uses a novel sampling distribution to avoid the
under-estimation of the optimal arm. We provide a tight regret analysis for
ExpTS, which simultaneously yields both the finite-time regret bound as well as
the asymptotic regret bound. In particular, for a $K$-armed bandit with
exponential family rewards, ExpTS over a horizon $T$ is sub-UCB (a strong
criterion for the finite-time regret that is problem-dependent), minimax
optimal up to a factor $\sqrt{\log K}$, and asymptotically optimal, for
exponential family rewards. Moreover, we propose ExpTS$^+$, by adding a greedy
exploitation step in addition to the sampling distribution used in ExpTS, to
avoid the over-estimation of sub-optimal arms. ExpTS$^+$ is an anytime bandit
algorithm and achieves the minimax optimality and asymptotic optimality
simultaneously for exponential family reward distributions. Our proof
techniques are general and conceptually simple and can be easily applied to
analyze standard Thompson sampling with specific reward distributions.
Related papers
- Finite-Time Frequentist Regret Bounds of Multi-Agent Thompson Sampling
on Sparse Hypergraphs [11.024467775280193]
We study the multi-agent multi-armed bandit (MAMAB) problem, where $m$ agents are factored into $rho$ overlapping groups.
We propose an efficient variant of MATS, the $epsilon$-exploring Multi-Agent Thompson Sampling ($epsilon$-MATS) algorithm.
We prove that $epsilon$-MATS achieves a worst-case frequentist regret bound that is sublinear in both the time horizon and the local arm size.
arXiv Detail & Related papers (2023-12-24T21:41:01Z) - Thompson Exploration with Best Challenger Rule in Best Arm
Identification [66.33448474838342]
We study the fixed-confidence best arm identification problem in the bandit framework.
We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z) - A General Recipe for the Analysis of Randomized Multi-Armed Bandit
Algorithms [16.114012813668932]
We revisit two famous bandit algorithms, Minimum Empirical Divergence (MED) and Thompson Sampling (TS)
We prove that MED is optimal for all these models, but also provide a simple regret analysis of some TS algorithms for which the optimality is already known.
arXiv Detail & Related papers (2023-03-10T16:43:48Z) - Langevin Monte Carlo for Contextual Bandits [72.00524614312002]
Langevin Monte Carlo Thompson Sampling (LMC-TS) is proposed to directly sample from the posterior distribution in contextual bandits.
We prove that the proposed algorithm achieves the same sublinear regret bound as the best Thompson sampling algorithms for a special case of contextual bandits.
arXiv Detail & Related papers (2022-06-22T17:58:23Z) - From Optimality to Robustness: Dirichlet Sampling Strategies in
Stochastic Bandits [0.0]
We study a generic Dirichlet Sampling (DS) algorithm, based on pairwise comparisons of empirical indices computed with re-sampling of the arms' observations.
We show that different variants of this strategy achieve provably optimal regret guarantees when the distributions are bounded and logarithmic regret for semi-bounded distributions with a mild quantile condition.
arXiv Detail & Related papers (2021-11-18T14:34:21Z) - A Unifying Theory of Thompson Sampling for Continuous Risk-Averse
Bandits [91.3755431537592]
This paper unifies the analysis of risk-averse Thompson sampling algorithms for the multi-armed bandit problem.
Using the contraction principle in the theory of large deviations, we prove novel concentration bounds for continuous risk functionals.
We show that a wide class of risk functionals as well as "nice" functions of them satisfy the continuity condition.
arXiv Detail & Related papers (2021-08-25T17:09:01Z) - Thompson Sampling for Unimodal Bandits [21.514495320038712]
We propose a Thompson Sampling algorithm for emphunimodal bandits, where the expected reward is unimodal over the partially ordered arms.
For Gaussian rewards, the regret of our algorithm is $mathcalO(log T)$, which is far better than standard Thompson Sampling algorithms.
arXiv Detail & Related papers (2021-06-15T14:40:34Z) - MOTS: Minimax Optimal Thompson Sampling [89.2370817955411]
It has remained an open problem whether Thompson sampling can match the minimax lower bound $Omega(sqrtKT)$ for $K$-armed bandit problems.
We propose a variant of Thompson sampling called MOTS that adaptively clips the sampling instance of the chosen arm at each time step.
We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound $O(sqrtKT)$ for finite time horizon $T$, as well as the optimal regret bound for Gaussian rewards when $T$ approaches infinity.
arXiv Detail & Related papers (2020-03-03T21:24:39Z) - A General Theory of the Stochastic Linear Bandit and Its Applications [8.071506311915398]
We introduce a general analysis framework and a family of algorithms for the linear bandit problem.
Our new notion of optimism in expectation gives rise to a new algorithm, called sieved greedy (SG) that reduces the overexploration problem in OFUL.
In addition to proving that SG is theoretically rate optimal, our empirical simulations show that SG outperforms existing benchmarks such as greedy, OFUL, and TS.
arXiv Detail & Related papers (2020-02-12T18:54:41Z) - Thompson Sampling Algorithms for Mean-Variance Bandits [97.43678751629189]
We develop Thompson Sampling-style algorithms for mean-variance MAB.
We also provide comprehensive regret analyses for Gaussian and Bernoulli bandits.
Our algorithms significantly outperform existing LCB-based algorithms for all risk tolerances.
arXiv Detail & Related papers (2020-02-01T15:33:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.