Batched Thompson Sampling
- URL: http://arxiv.org/abs/2110.00202v1
- Date: Fri, 1 Oct 2021 04:08:45 GMT
- Title: Batched Thompson Sampling
- Authors: Cem Kalkanli and Ayfer Ozgur
- Abstract summary: We introduce a novel anytime Batched Thompson sampling policy for multi-armed bandits.
We show that this policy simultaneously achieves a problem dependent regret of order $O(log(T))$ and a minimax regret of order $O(sqrtTlog(T))$.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a novel anytime Batched Thompson sampling policy for multi-armed
bandits where the agent observes the rewards of her actions and adjusts her
policy only at the end of a small number of batches. We show that this policy
simultaneously achieves a problem dependent regret of order $O(\log(T))$ and a
minimax regret of order $O(\sqrt{T\log(T)})$ while the number of batches can be
bounded by $O(\log(T))$ independent of the problem instance over a time horizon
$T$. We also show that in expectation the number of batches used by our policy
can be bounded by an instance dependent bound of order $O(\log\log(T))$. These
results indicate that Thompson sampling maintains the same performance in this
batched setting as in the case when instantaneous feedback is available after
each action, while requiring minimal feedback. These results also indicate that
Thompson sampling performs competitively with recently proposed algorithms
tailored for the batched setting. These algorithms optimize the batch structure
for a given time horizon $T$ and prioritize exploration in the beginning of the
experiment to eliminate suboptimal actions. We show that Thompson sampling
combined with an adaptive batching strategy can achieve a similar performance
without knowing the time horizon $T$ of the problem and without having to
carefully optimize the batch structure to achieve a target regret bound (i.e.
problem dependent vs minimax regret) for a given $T$.
Related papers
- Efficient and Adaptive Posterior Sampling Algorithms for Bandits [5.050520326139362]
We study Thompson Sampling-based algorithms for bandits with bounded rewards.
We propose two parameterized Thompson Sampling-based algorithms.
Both algorithms achieve $O left(Klnalpha+1(T)/Delta right)$ regret bound, where $K$ is the number of arms, $T$ is the finite learning horizon, and $Delta$ denotes the single round performance loss when pulling a sub-optimal arm.
arXiv Detail & Related papers (2024-05-02T05:24:28Z) - Provably Efficient High-Dimensional Bandit Learning with Batched
Feedbacks [93.00280593719513]
We study high-dimensional multi-armed contextual bandits with batched feedback where the $T$ steps of online interactions are divided into $L$ batches.
In specific, each batch collects data according to a policy that depends on previous batches and the rewards are revealed only at the end of the batch.
Our algorithm achieves regret bounds comparable to those in fully sequential setting with only $mathcalO( log T)$ batches.
arXiv Detail & Related papers (2023-11-22T06:06:54Z) - Instance-optimal PAC Algorithms for Contextual Bandits [20.176752818200438]
In this work, we focus on the bandit problem in the $(epsilon,delta)$-$textitPAC$ setting.
We show that no algorithm can be simultaneously minimax-optimal regret minimization and instance-dependent PAC for best-arm identification.
arXiv Detail & Related papers (2022-07-05T23:19:43Z) - Finite-Time Regret of Thompson Sampling Algorithms for Exponential
Family Multi-Armed Bandits [88.21288104408556]
We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family.
We propose a Thompson sampling, termed Expulli, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm.
arXiv Detail & Related papers (2022-06-07T18:08:21Z) - Asymptotic Performance of Thompson Sampling in the Batched Multi-Armed
Bandits [0.0]
We show that Thompson sampling can maintain its performance even if it receives delayed feedback in batches.
We propose an adaptive scheme that reduces the number of batches to $Theta(log T)$ while maintaining the same performance.
arXiv Detail & Related papers (2021-10-01T01:28:40Z) - Parallelizing Thompson Sampling [38.13649699234982]
We introduce a batch Thompson Sampling framework for two canonical online decision making problems.
Our policy achieves the same (asymptotic) regret bound of a fully sequential one while carrying out only $O(log T)$ batch queries.
We also demonstrate experimentally that dynamic batch allocation dramatically outperforms natural baselines such as static batch allocations.
arXiv Detail & Related papers (2021-06-02T18:51:57Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Analysis and Design of Thompson Sampling for Stochastic Partial
Monitoring [91.22679787578438]
We present a novel Thompson-sampling-based algorithm for partial monitoring.
We prove that the new algorithm achieves the logarithmic problem-dependent expected pseudo-regret $mathrmO(log T)$ for a linearized variant of the problem with local observability.
arXiv Detail & Related papers (2020-06-17T05:48:33Z) - An Efficient Algorithm For Generalized Linear Bandit: Online Stochastic
Gradient Descent and Thompson Sampling [83.48992319018147]
We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward.
A natural way to resolve this problem is to apply online gradient descent (SGD) so that the per-step time and memory complexity can be reduced to constant.
In this work, we show that online SGD can be applied to the generalized linear bandit problem.
The proposed SGD-TS algorithm, which uses a single-step SGD update to exploit past information, achieves $tildeO(sqrtT)$ regret with the total time complexity that
arXiv Detail & Related papers (2020-06-07T01:12:39Z) - MOTS: Minimax Optimal Thompson Sampling [89.2370817955411]
It has remained an open problem whether Thompson sampling can match the minimax lower bound $Omega(sqrtKT)$ for $K$-armed bandit problems.
We propose a variant of Thompson sampling called MOTS that adaptively clips the sampling instance of the chosen arm at each time step.
We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound $O(sqrtKT)$ for finite time horizon $T$, as well as the optimal regret bound for Gaussian rewards when $T$ approaches infinity.
arXiv Detail & Related papers (2020-03-03T21:24:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.