Related papers: Parallelizing Thompson Sampling

Parallelizing Thompson Sampling

URL: http://arxiv.org/abs/2106.01420v1
Date: Wed, 2 Jun 2021 18:51:57 GMT
Title: Parallelizing Thompson Sampling
Authors: Amin Karbasi, Vahab Mirrokni, Mohammad Shadravan
Abstract summary: We introduce a batch Thompson Sampling framework for two canonical online decision making problems. Our policy achieves the same (asymptotic) regret bound of a fully sequential one while carrying out only $O(log T)$ batch queries. We also demonstrate experimentally that dynamic batch allocation dramatically outperforms natural baselines such as static batch allocations.
Score: 38.13649699234982
License: http://creativecommons.org/licenses/by/4.0/
Abstract: How can we make use of information parallelism in online decision making problems while efficiently balancing the exploration-exploitation trade-off? In this paper, we introduce a batch Thompson Sampling framework for two canonical online decision making problems, namely, stochastic multi-arm bandit and linear contextual bandit with finitely many arms. Over a time horizon $T$, our \textit{batch} Thompson Sampling policy achieves the same (asymptotic) regret bound of a fully sequential one while carrying out only $O(\log T)$ batch queries. To achieve this exponential reduction, i.e., reducing the number of interactions from $T$ to $O(\log T)$, our batch policy dynamically determines the duration of each batch in order to balance the exploration-exploitation trade-off. We also demonstrate experimentally that dynamic batch allocation dramatically outperforms natural baselines such as static batch allocations.

Related papers

Finite-Time Frequentist Regret Bounds of Multi-Agent Thompson Sampling on Sparse Hypergraphs [11.024467775280193]
We study the multi-agent multi-armed bandit (MAMAB) problem, where $m$ agents are factored into $rho$ overlapping groups. We propose an efficient variant of MATS, the $epsilon$-exploring Multi-Agent Thompson Sampling ($epsilon$-MATS) algorithm. We prove that $epsilon$-MATS achieves a worst-case frequentist regret bound that is sublinear in both the time horizon and the local arm size.
arXiv Detail & Related papers (2023-12-24T21:41:01Z)
Provably Efficient High-Dimensional Bandit Learning with Batched Feedbacks [93.00280593719513]
We study high-dimensional multi-armed contextual bandits with batched feedback where the $T$ steps of online interactions are divided into $L$ batches. In specific, each batch collects data according to a policy that depends on previous batches and the rewards are revealed only at the end of the batch. Our algorithm achieves regret bounds comparable to those in fully sequential setting with only $mathcalO( log T)$ batches.
arXiv Detail & Related papers (2023-11-22T06:06:54Z)
Langevin Thompson Sampling with Logarithmic Communication: Bandits and Reinforcement Learning [34.4255062106615]
Thompson sampling (TS) is widely used in sequential decision making due to its ease of use and appealing empirical performance. We propose batched $textitLangevin Thompson Sampling$ algorithms that leverage MCMC methods to sample from approximate posteriors with only logarithmic communication costs in terms of batches. Our algorithms are computationally efficient and maintain the same order-optimal regret guarantees of $mathcalO(log T)$ for MABs, and $mathcalO(sqrtT)$ for RL.
arXiv Detail & Related papers (2023-06-15T01:16:29Z)
Batched Thompson Sampling [0.0]
We introduce a novel anytime Batched Thompson sampling policy for multi-armed bandits. We show that this policy simultaneously achieves a problem dependent regret of order $O(log(T))$ and a minimax regret of order $O(sqrtTlog(T))$.
arXiv Detail & Related papers (2021-10-01T04:08:45Z)
Asymptotic Performance of Thompson Sampling in the Batched Multi-Armed Bandits [0.0]
We show that Thompson sampling can maintain its performance even if it receives delayed feedback in batches. We propose an adaptive scheme that reduces the number of batches to $Theta(log T)$ while maintaining the same performance.
arXiv Detail & Related papers (2021-10-01T01:28:40Z)
Markov Decision Process modeled with Bandits for Sequential Decision Making in Linear-flow [73.1896399783641]
In membership/subscriber acquisition and retention, we sometimes need to recommend marketing content for multiple pages in sequence. We propose to formulate the problem as an MDP with Bandits where Bandits are employed to model the transition probability matrix. We observe the proposed MDP with Bandits algorithm outperforms Q-learning with $epsilon$-greedy and decreasing $epsilon$, independent Bandits, and interaction Bandits.
arXiv Detail & Related papers (2021-07-01T03:54:36Z)
Batched Neural Bandits [107.5072688105936]
BatchNeuralUCB combines neural networks with optimism to address the exploration-exploitation tradeoff. We prove that BatchNeuralUCB achieves the same regret as the fully sequential version while reducing the number of policy updates considerably.
arXiv Detail & Related papers (2021-02-25T17:36:44Z)
An Efficient Algorithm For Generalized Linear Bandit: Online Stochastic Gradient Descent and Thompson Sampling [83.48992319018147]
We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward. A natural way to resolve this problem is to apply online gradient descent (SGD) so that the per-step time and memory complexity can be reduced to constant. In this work, we show that online SGD can be applied to the generalized linear bandit problem. The proposed SGD-TS algorithm, which uses a single-step SGD update to exploit past information, achieves $tildeO(sqrtT)$ regret with the total time complexity that
arXiv Detail & Related papers (2020-06-07T01:12:39Z)
MOTS: Minimax Optimal Thompson Sampling [89.2370817955411]
It has remained an open problem whether Thompson sampling can match the minimax lower bound $Omega(sqrtKT)$ for $K$-armed bandit problems. We propose a variant of Thompson sampling called MOTS that adaptively clips the sampling instance of the chosen arm at each time step. We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound $O(sqrtKT)$ for finite time horizon $T$, as well as the optimal regret bound for Gaussian rewards when $T$ approaches infinity.
arXiv Detail & Related papers (2020-03-03T21:24:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.