Parallelizing Thompson Sampling
- URL: http://arxiv.org/abs/2106.01420v1
- Date: Wed, 2 Jun 2021 18:51:57 GMT
- Title: Parallelizing Thompson Sampling
- Authors: Amin Karbasi, Vahab Mirrokni, Mohammad Shadravan
- Abstract summary: We introduce a batch Thompson Sampling framework for two canonical online decision making problems.
Our policy achieves the same (asymptotic) regret bound of a fully sequential one while carrying out only $O(log T)$ batch queries.
We also demonstrate experimentally that dynamic batch allocation dramatically outperforms natural baselines such as static batch allocations.
- Score: 38.13649699234982
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: How can we make use of information parallelism in online decision making
problems while efficiently balancing the exploration-exploitation trade-off? In
this paper, we introduce a batch Thompson Sampling framework for two canonical
online decision making problems, namely, stochastic multi-arm bandit and linear
contextual bandit with finitely many arms. Over a time horizon $T$, our
\textit{batch} Thompson Sampling policy achieves the same (asymptotic) regret
bound of a fully sequential one while carrying out only $O(\log T)$ batch
queries. To achieve this exponential reduction, i.e., reducing the number of
interactions from $T$ to $O(\log T)$, our batch policy dynamically determines
the duration of each batch in order to balance the exploration-exploitation
trade-off. We also demonstrate experimentally that dynamic batch allocation
dramatically outperforms natural baselines such as static batch allocations.
Related papers
- Finite-Time Frequentist Regret Bounds of Multi-Agent Thompson Sampling
on Sparse Hypergraphs [11.024467775280193]
We study the multi-agent multi-armed bandit (MAMAB) problem, where $m$ agents are factored into $rho$ overlapping groups.
We propose an efficient variant of MATS, the $epsilon$-exploring Multi-Agent Thompson Sampling ($epsilon$-MATS) algorithm.
We prove that $epsilon$-MATS achieves a worst-case frequentist regret bound that is sublinear in both the time horizon and the local arm size.
arXiv Detail & Related papers (2023-12-24T21:41:01Z) - Provably Efficient High-Dimensional Bandit Learning with Batched
Feedbacks [93.00280593719513]
We study high-dimensional multi-armed contextual bandits with batched feedback where the $T$ steps of online interactions are divided into $L$ batches.
In specific, each batch collects data according to a policy that depends on previous batches and the rewards are revealed only at the end of the batch.
Our algorithm achieves regret bounds comparable to those in fully sequential setting with only $mathcalO( log T)$ batches.
arXiv Detail & Related papers (2023-11-22T06:06:54Z) - Langevin Thompson Sampling with Logarithmic Communication: Bandits and
Reinforcement Learning [34.4255062106615]
Thompson sampling (TS) is widely used in sequential decision making due to its ease of use and appealing empirical performance.
We propose batched $textitLangevin Thompson Sampling$ algorithms that leverage MCMC methods to sample from approximate posteriors with only logarithmic communication costs in terms of batches.
Our algorithms are computationally efficient and maintain the same order-optimal regret guarantees of $mathcalO(log T)$ for MABs, and $mathcalO(sqrtT)$ for RL.
arXiv Detail & Related papers (2023-06-15T01:16:29Z) - Batched Thompson Sampling [0.0]
We introduce a novel anytime Batched Thompson sampling policy for multi-armed bandits.
We show that this policy simultaneously achieves a problem dependent regret of order $O(log(T))$ and a minimax regret of order $O(sqrtTlog(T))$.
arXiv Detail & Related papers (2021-10-01T04:08:45Z) - Asymptotic Performance of Thompson Sampling in the Batched Multi-Armed
Bandits [0.0]
We show that Thompson sampling can maintain its performance even if it receives delayed feedback in batches.
We propose an adaptive scheme that reduces the number of batches to $Theta(log T)$ while maintaining the same performance.
arXiv Detail & Related papers (2021-10-01T01:28:40Z) - Markov Decision Process modeled with Bandits for Sequential Decision
Making in Linear-flow [73.1896399783641]
In membership/subscriber acquisition and retention, we sometimes need to recommend marketing content for multiple pages in sequence.
We propose to formulate the problem as an MDP with Bandits where Bandits are employed to model the transition probability matrix.
We observe the proposed MDP with Bandits algorithm outperforms Q-learning with $epsilon$-greedy and decreasing $epsilon$, independent Bandits, and interaction Bandits.
arXiv Detail & Related papers (2021-07-01T03:54:36Z) - Batched Neural Bandits [107.5072688105936]
BatchNeuralUCB combines neural networks with optimism to address the exploration-exploitation tradeoff.
We prove that BatchNeuralUCB achieves the same regret as the fully sequential version while reducing the number of policy updates considerably.
arXiv Detail & Related papers (2021-02-25T17:36:44Z) - An Efficient Algorithm For Generalized Linear Bandit: Online Stochastic
Gradient Descent and Thompson Sampling [83.48992319018147]
We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward.
A natural way to resolve this problem is to apply online gradient descent (SGD) so that the per-step time and memory complexity can be reduced to constant.
In this work, we show that online SGD can be applied to the generalized linear bandit problem.
The proposed SGD-TS algorithm, which uses a single-step SGD update to exploit past information, achieves $tildeO(sqrtT)$ regret with the total time complexity that
arXiv Detail & Related papers (2020-06-07T01:12:39Z) - MOTS: Minimax Optimal Thompson Sampling [89.2370817955411]
It has remained an open problem whether Thompson sampling can match the minimax lower bound $Omega(sqrtKT)$ for $K$-armed bandit problems.
We propose a variant of Thompson sampling called MOTS that adaptively clips the sampling instance of the chosen arm at each time step.
We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound $O(sqrtKT)$ for finite time horizon $T$, as well as the optimal regret bound for Gaussian rewards when $T$ approaches infinity.
arXiv Detail & Related papers (2020-03-03T21:24:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.