Asymptotic Performance of Thompson Sampling in the Batched Multi-Armed
Bandits
- URL: http://arxiv.org/abs/2110.00158v1
- Date: Fri, 1 Oct 2021 01:28:40 GMT
- Title: Asymptotic Performance of Thompson Sampling in the Batched Multi-Armed
Bandits
- Authors: Cem Kalkanli and Ayfer Ozgur
- Abstract summary: We show that Thompson sampling can maintain its performance even if it receives delayed feedback in batches.
We propose an adaptive scheme that reduces the number of batches to $Theta(log T)$ while maintaining the same performance.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the asymptotic performance of the Thompson sampling algorithm in the
batched multi-armed bandit setting where the time horizon $T$ is divided into
batches, and the agent is not able to observe the rewards of her actions until
the end of each batch. We show that in this batched setting, Thompson sampling
achieves the same asymptotic performance as in the case where instantaneous
feedback is available after each action, provided that the batch sizes increase
subexponentially. This result implies that Thompson sampling can maintain its
performance even if it receives delayed feedback in $\omega(\log T)$ batches.
We further propose an adaptive batching scheme that reduces the number of
batches to $\Theta(\log T)$ while maintaining the same performance. Although
the batched multi-armed bandit setting has been considered in several recent
works, previous results rely on tailored algorithms for the batched setting,
which optimize the batch structure and prioritize exploration in the beginning
of the experiment to eliminate suboptimal actions. We show that Thompson
sampling, on the other hand, is able to achieve a similar asymptotic
performance in the batched setting without any modifications.
Related papers
- Finite-Time Frequentist Regret Bounds of Multi-Agent Thompson Sampling
on Sparse Hypergraphs [11.024467775280193]
We study the multi-agent multi-armed bandit (MAMAB) problem, where $m$ agents are factored into $rho$ overlapping groups.
We propose an efficient variant of MATS, the $epsilon$-exploring Multi-Agent Thompson Sampling ($epsilon$-MATS) algorithm.
We prove that $epsilon$-MATS achieves a worst-case frequentist regret bound that is sublinear in both the time horizon and the local arm size.
arXiv Detail & Related papers (2023-12-24T21:41:01Z) - Provably Efficient High-Dimensional Bandit Learning with Batched
Feedbacks [93.00280593719513]
We study high-dimensional multi-armed contextual bandits with batched feedback where the $T$ steps of online interactions are divided into $L$ batches.
In specific, each batch collects data according to a policy that depends on previous batches and the rewards are revealed only at the end of the batch.
Our algorithm achieves regret bounds comparable to those in fully sequential setting with only $mathcalO( log T)$ batches.
arXiv Detail & Related papers (2023-11-22T06:06:54Z) - Finite-Time Regret of Thompson Sampling Algorithms for Exponential
Family Multi-Armed Bandits [88.21288104408556]
We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family.
We propose a Thompson sampling, termed Expulli, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm.
arXiv Detail & Related papers (2022-06-07T18:08:21Z) - Batched Thompson Sampling [0.0]
We introduce a novel anytime Batched Thompson sampling policy for multi-armed bandits.
We show that this policy simultaneously achieves a problem dependent regret of order $O(log(T))$ and a minimax regret of order $O(sqrtTlog(T))$.
arXiv Detail & Related papers (2021-10-01T04:08:45Z) - Parallelizing Thompson Sampling [38.13649699234982]
We introduce a batch Thompson Sampling framework for two canonical online decision making problems.
Our policy achieves the same (asymptotic) regret bound of a fully sequential one while carrying out only $O(log T)$ batch queries.
We also demonstrate experimentally that dynamic batch allocation dramatically outperforms natural baselines such as static batch allocations.
arXiv Detail & Related papers (2021-06-02T18:51:57Z) - Asymptotic Convergence of Thompson Sampling [0.0]
Thompson sampling has been shown to be an effective policy across a variety of online learning tasks.
We prove an convergence result for Thompson sampling under the assumption of a sub-linear Bayesian regret.
Our results rely on the martingale structure inherent in Thompson sampling.
arXiv Detail & Related papers (2020-11-08T07:36:49Z) - Analysis and Design of Thompson Sampling for Stochastic Partial
Monitoring [91.22679787578438]
We present a novel Thompson-sampling-based algorithm for partial monitoring.
We prove that the new algorithm achieves the logarithmic problem-dependent expected pseudo-regret $mathrmO(log T)$ for a linearized variant of the problem with local observability.
arXiv Detail & Related papers (2020-06-17T05:48:33Z) - An Efficient Algorithm For Generalized Linear Bandit: Online Stochastic
Gradient Descent and Thompson Sampling [83.48992319018147]
We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward.
A natural way to resolve this problem is to apply online gradient descent (SGD) so that the per-step time and memory complexity can be reduced to constant.
In this work, we show that online SGD can be applied to the generalized linear bandit problem.
The proposed SGD-TS algorithm, which uses a single-step SGD update to exploit past information, achieves $tildeO(sqrtT)$ regret with the total time complexity that
arXiv Detail & Related papers (2020-06-07T01:12:39Z) - MOTS: Minimax Optimal Thompson Sampling [89.2370817955411]
It has remained an open problem whether Thompson sampling can match the minimax lower bound $Omega(sqrtKT)$ for $K$-armed bandit problems.
We propose a variant of Thompson sampling called MOTS that adaptively clips the sampling instance of the chosen arm at each time step.
We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound $O(sqrtKT)$ for finite time horizon $T$, as well as the optimal regret bound for Gaussian rewards when $T$ approaches infinity.
arXiv Detail & Related papers (2020-03-03T21:24:39Z) - On Thompson Sampling with Langevin Algorithms [106.78254564840844]
Thompson sampling for multi-armed bandit problems enjoys favorable performance in both theory and practice.
It suffers from a significant limitation computationally, arising from the need for samples from posterior distributions at every iteration.
We propose two Markov Chain Monte Carlo (MCMC) methods tailored to Thompson sampling to address this issue.
arXiv Detail & Related papers (2020-02-23T22:35:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.