On Thompson Sampling with Langevin Algorithms
- URL: http://arxiv.org/abs/2002.10002v2
- Date: Thu, 18 Jun 2020 02:02:30 GMT
- Title: On Thompson Sampling with Langevin Algorithms
- Authors: Eric Mazumdar, Aldo Pacchiano, Yi-an Ma, Peter L. Bartlett, Michael I.
Jordan
- Abstract summary: Thompson sampling for multi-armed bandit problems enjoys favorable performance in both theory and practice.
It suffers from a significant limitation computationally, arising from the need for samples from posterior distributions at every iteration.
We propose two Markov Chain Monte Carlo (MCMC) methods tailored to Thompson sampling to address this issue.
- Score: 106.78254564840844
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Thompson sampling for multi-armed bandit problems is known to enjoy favorable
performance in both theory and practice. However, it suffers from a significant
limitation computationally, arising from the need for samples from posterior
distributions at every iteration. We propose two Markov Chain Monte Carlo
(MCMC) methods tailored to Thompson sampling to address this issue. We
construct quickly converging Langevin algorithms to generate approximate
samples that have accuracy guarantees, and we leverage novel posterior
concentration rates to analyze the regret of the resulting approximate Thompson
sampling algorithm. Further, we specify the necessary hyperparameters for the
MCMC procedure to guarantee optimal instance-dependent frequentist regret while
having low computational complexity. In particular, our algorithms take
advantage of both posterior concentration and a sample reuse mechanism to
ensure that only a constant number of iterations and a constant amount of data
is needed in each round. The resulting approximate Thompson sampling algorithm
has logarithmic regret and its computational complexity does not scale with the
time horizon of the algorithm.
Related papers
- Accelerating Approximate Thompson Sampling with Underdamped Langevin Monte Carlo [7.968641076961955]
We propose an approximate Thompson sampling strategy utilizing Langevin Monte Carlo.
Based on the standard smoothness and log-concavity conditions, we study the accelerated posterior concentration and sampling.
Our algorithm is empirically validated through synthetic experiments in high-dimensional bandit problems.
arXiv Detail & Related papers (2024-01-22T02:54:58Z) - Making RL with Preference-based Feedback Efficient via Randomization [11.019088464664696]
Reinforcement Learning algorithms that learn from human feedback need to be efficient in terms of statistical complexity, computational complexity, and query complexity.
We present an algorithm that is sample efficient (i.e. has near-optimal-case regret bounds) and has running time (i.e. computational complexity is worst with respect to relevant parameters)
To extend the results to more general nonlinear function approximation, we design a model-based randomized algorithm inspired by the idea of Thompson sampling.
arXiv Detail & Related papers (2023-10-23T04:19:35Z) - An Asymptotically Optimal Algorithm for the Convex Hull Membership Problem [21.312152185262]
We study the convex hull membership problem in the pure exploration setting.
We present the firstally optimal algorithm called Thompson-CHM, whose modular design consists of a stopping rule and a sampling rule.
arXiv Detail & Related papers (2023-02-03T23:41:53Z) - Langevin Monte Carlo for Contextual Bandits [72.00524614312002]
Langevin Monte Carlo Thompson Sampling (LMC-TS) is proposed to directly sample from the posterior distribution in contextual bandits.
We prove that the proposed algorithm achieves the same sublinear regret bound as the best Thompson sampling algorithms for a special case of contextual bandits.
arXiv Detail & Related papers (2022-06-22T17:58:23Z) - Machine Learning for Online Algorithm Selection under Censored Feedback [71.6879432974126]
In online algorithm selection (OAS), instances of an algorithmic problem class are presented to an agent one after another, and the agent has to quickly select a presumably best algorithm from a fixed set of candidate algorithms.
For decision problems such as satisfiability (SAT), quality typically refers to the algorithm's runtime.
In this work, we revisit multi-armed bandit algorithms for OAS and discuss their capability of dealing with the problem.
We adapt them towards runtime-oriented losses, allowing for partially censored data while keeping a space- and time-complexity independent of the time horizon.
arXiv Detail & Related papers (2021-09-13T18:10:52Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Analysis and Design of Thompson Sampling for Stochastic Partial
Monitoring [91.22679787578438]
We present a novel Thompson-sampling-based algorithm for partial monitoring.
We prove that the new algorithm achieves the logarithmic problem-dependent expected pseudo-regret $mathrmO(log T)$ for a linearized variant of the problem with local observability.
arXiv Detail & Related papers (2020-06-17T05:48:33Z) - MOTS: Minimax Optimal Thompson Sampling [89.2370817955411]
It has remained an open problem whether Thompson sampling can match the minimax lower bound $Omega(sqrtKT)$ for $K$-armed bandit problems.
We propose a variant of Thompson sampling called MOTS that adaptively clips the sampling instance of the chosen arm at each time step.
We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound $O(sqrtKT)$ for finite time horizon $T$, as well as the optimal regret bound for Gaussian rewards when $T$ approaches infinity.
arXiv Detail & Related papers (2020-03-03T21:24:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.