Doubly-Adaptive Thompson Sampling for Multi-Armed and Contextual Bandits
- URL: http://arxiv.org/abs/2102.13202v1
- Date: Thu, 25 Feb 2021 22:29:25 GMT
- Title: Doubly-Adaptive Thompson Sampling for Multi-Armed and Contextual Bandits
- Authors: Maria Dimakopoulou, Zhimei Ren, Zhengyuan Zhou
- Abstract summary: We propose a variant of a Thompson sampling based algorithm that adaptively re-weighs the terms of a doubly robust estimator on the true mean reward of each arm.
The proposed algorithm matches the optimal (minimax) regret rate and its empirical evaluation in a semi-synthetic experiment.
We extend this approach to contextual bandits, where there are more sources of bias present apart from the adaptive data collection.
- Score: 28.504921333436833
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To balance exploration and exploitation, multi-armed bandit algorithms need
to conduct inference on the true mean reward of each arm in every time step
using the data collected so far. However, the history of arms and rewards
observed up to that time step is adaptively collected and there are known
challenges in conducting inference with non-iid data. In particular, sample
averages, which play a prominent role in traditional upper confidence bound
algorithms and traditional Thompson sampling algorithms, are neither unbiased
nor asymptotically normal. We propose a variant of a Thompson sampling based
algorithm that leverages recent advances in the causal inference literature and
adaptively re-weighs the terms of a doubly robust estimator on the true mean
reward of each arm -- hence its name doubly-adaptive Thompson sampling. The
regret of the proposed algorithm matches the optimal (minimax) regret rate and
its empirical evaluation in a semi-synthetic experiment based on data from a
randomized control trial of a web service is performed: we see that the
proposed doubly-adaptive Thompson sampling has superior empirical performance
to existing baselines in terms of cumulative regret and statistical power in
identifying the best arm. Further, we extend this approach to contextual
bandits, where there are more sources of bias present apart from the adaptive
data collection -- such as the mismatch between the true data generating
process and the reward model assumptions or the unequal representations of
certain regions of the context space in initial stages of learning -- and
propose the linear contextual doubly-adaptive Thompson sampling and the
non-parametric contextual doubly-adaptive Thompson sampling extensions of our
approach.
Related papers
- Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms.
We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z) - VITS : Variational Inference Thompson Sampling for contextual bandits [10.028119153832346]
We introduce and analyze a variant of the Thompson sampling (TS) algorithm for contextual bandits.
We propose a new algorithm, Varational Inference Thompson sampling VITS, based on Gaussian Variational Inference.
We show that VITS achieves a sub-linear regret bound of the same order in the dimension and number of round as traditional TS for linear contextual bandit.
arXiv Detail & Related papers (2023-07-19T17:53:22Z) - Incentivizing Exploration with Linear Contexts and Combinatorial Actions [9.15749739027059]
In incentivized bandit exploration, arm choices are viewed as recommendations and are required to be Bayesian incentive compatible.
Recent work has shown under certain independence assumptions that after collecting enough initial samples, the popular Thompson sampling algorithm becomes incentive compatible.
We give an analog of this result for linear bandits, where the independence of the prior is replaced by a natural convexity condition.
arXiv Detail & Related papers (2023-06-03T03:30:42Z) - Thompson Sampling with Virtual Helping Agents [0.0]
We address the problem of online sequential decision making, i.e., balancing the trade-off between exploiting current knowledge to maximize immediate performance and exploring the new information to gain long-term benefits.
We propose two algorithms for the multi-armed bandit problem and provide theoretical bounds on the cumulative regret.
arXiv Detail & Related papers (2022-09-16T23:34:44Z) - Langevin Monte Carlo for Contextual Bandits [72.00524614312002]
Langevin Monte Carlo Thompson Sampling (LMC-TS) is proposed to directly sample from the posterior distribution in contextual bandits.
We prove that the proposed algorithm achieves the same sublinear regret bound as the best Thompson sampling algorithms for a special case of contextual bandits.
arXiv Detail & Related papers (2022-06-22T17:58:23Z) - Algorithms for Adaptive Experiments that Trade-off Statistical Analysis
with Reward: Combining Uniform Random Assignment and Reward Maximization [50.725191156128645]
Multi-armed bandit algorithms like Thompson Sampling can be used to conduct adaptive experiments.
We present simulations for 2-arm experiments that explore two algorithms that combine the benefits of uniform randomization for statistical analysis.
arXiv Detail & Related papers (2021-12-15T22:11:58Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - Analysis and Design of Thompson Sampling for Stochastic Partial
Monitoring [91.22679787578438]
We present a novel Thompson-sampling-based algorithm for partial monitoring.
We prove that the new algorithm achieves the logarithmic problem-dependent expected pseudo-regret $mathrmO(log T)$ for a linearized variant of the problem with local observability.
arXiv Detail & Related papers (2020-06-17T05:48:33Z) - Ensemble Sampling [18.85309520133554]
This paper develops ensemble sampling, which aims to approximate Thompson sampling while maintaining tractability even in the face of complex models such as neural networks.
We establish a theoretical basis that supports the approach and present computational results that offer further insight.
arXiv Detail & Related papers (2017-05-20T19:36:36Z) - The Simulator: Understanding Adaptive Sampling in the
Moderate-Confidence Regime [52.38455827779212]
We propose a novel technique for analyzing adaptive sampling called the em Simulator.
We prove the first instance-based lower bounds the top-k problem which incorporate the appropriate log-factors.
Our new analysis inspires a simple and near-optimal for the best-arm and top-k identification, the first em practical of its kind for the latter problem.
arXiv Detail & Related papers (2017-02-16T23:42:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.