Langevin Thompson Sampling with Logarithmic Communication: Bandits and
Reinforcement Learning
- URL: http://arxiv.org/abs/2306.08803v1
- Date: Thu, 15 Jun 2023 01:16:29 GMT
- Title: Langevin Thompson Sampling with Logarithmic Communication: Bandits and
Reinforcement Learning
- Authors: Amin Karbasi, Nikki Lijing Kuang, Yi-An Ma, Siddharth Mitra
- Abstract summary: Thompson sampling (TS) is widely used in sequential decision making due to its ease of use and appealing empirical performance.
We propose batched $textitLangevin Thompson Sampling$ algorithms that leverage MCMC methods to sample from approximate posteriors with only logarithmic communication costs in terms of batches.
Our algorithms are computationally efficient and maintain the same order-optimal regret guarantees of $mathcalO(log T)$ for MABs, and $mathcalO(sqrtT)$ for RL.
- Score: 34.4255062106615
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Thompson sampling (TS) is widely used in sequential decision making due to
its ease of use and appealing empirical performance. However, many existing
analytical and empirical results for TS rely on restrictive assumptions on
reward distributions, such as belonging to conjugate families, which limits
their applicability in realistic scenarios. Moreover, sequential decision
making problems are often carried out in a batched manner, either due to the
inherent nature of the problem or to serve the purpose of reducing
communication and computation costs. In this work, we jointly study these
problems in two popular settings, namely, stochastic multi-armed bandits (MABs)
and infinite-horizon reinforcement learning (RL), where TS is used to learn the
unknown reward distributions and transition dynamics, respectively. We propose
batched $\textit{Langevin Thompson Sampling}$ algorithms that leverage MCMC
methods to sample from approximate posteriors with only logarithmic
communication costs in terms of batches. Our algorithms are computationally
efficient and maintain the same order-optimal regret guarantees of
$\mathcal{O}(\log T)$ for stochastic MABs, and $\mathcal{O}(\sqrt{T})$ for RL.
We complement our theoretical findings with experimental results.
Related papers
- VITS : Variational Inference Thompson Sampling for contextual bandits [10.028119153832346]
We introduce and analyze a variant of the Thompson sampling (TS) algorithm for contextual bandits.
We propose a new algorithm, Varational Inference Thompson sampling VITS, based on Gaussian Variational Inference.
We show that VITS achieves a sub-linear regret bound of the same order in the dimension and number of round as traditional TS for linear contextual bandit.
arXiv Detail & Related papers (2023-07-19T17:53:22Z) - Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL)
We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo.
Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z) - Two-sided Competing Matching Recommendation Markets With Quota and Complementary Preferences Constraints [13.069703665055084]
We propose a new recommendation algorithm for addressing the problem of two-sided online matching markets with complementary preferences and quota constraints.
The presence of mixed quota and complementary preferences constraints can lead to instability in the matching process.
We formulate the problem as a bandit learning framework and propose the Multi-agent Multi-type Thompson Sampling algorithm.
arXiv Detail & Related papers (2023-01-24T18:54:29Z) - Langevin Monte Carlo for Contextual Bandits [72.00524614312002]
Langevin Monte Carlo Thompson Sampling (LMC-TS) is proposed to directly sample from the posterior distribution in contextual bandits.
We prove that the proposed algorithm achieves the same sublinear regret bound as the best Thompson sampling algorithms for a special case of contextual bandits.
arXiv Detail & Related papers (2022-06-22T17:58:23Z) - Sample-Efficient Reinforcement Learning of Undercomplete POMDPs [91.40308354344505]
This work shows that these hardness barriers do not preclude efficient reinforcement learning for rich and interesting subclasses of Partially Observable Decision Processes (POMDPs)
We present a sample-efficient algorithm, OOM-UCB, for episodic finite undercomplete POMDPs, where the number of observations is larger than the number of latent states and where exploration is essential for learning, thus distinguishing our results from prior works.
arXiv Detail & Related papers (2020-06-22T17:58:54Z) - Analysis and Design of Thompson Sampling for Stochastic Partial
Monitoring [91.22679787578438]
We present a novel Thompson-sampling-based algorithm for partial monitoring.
We prove that the new algorithm achieves the logarithmic problem-dependent expected pseudo-regret $mathrmO(log T)$ for a linearized variant of the problem with local observability.
arXiv Detail & Related papers (2020-06-17T05:48:33Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z) - On Thompson Sampling with Langevin Algorithms [106.78254564840844]
Thompson sampling for multi-armed bandit problems enjoys favorable performance in both theory and practice.
It suffers from a significant limitation computationally, arising from the need for samples from posterior distributions at every iteration.
We propose two Markov Chain Monte Carlo (MCMC) methods tailored to Thompson sampling to address this issue.
arXiv Detail & Related papers (2020-02-23T22:35:29Z) - A General Theory of the Stochastic Linear Bandit and Its Applications [8.071506311915398]
We introduce a general analysis framework and a family of algorithms for the linear bandit problem.
Our new notion of optimism in expectation gives rise to a new algorithm, called sieved greedy (SG) that reduces the overexploration problem in OFUL.
In addition to proving that SG is theoretically rate optimal, our empirical simulations show that SG outperforms existing benchmarks such as greedy, OFUL, and TS.
arXiv Detail & Related papers (2020-02-12T18:54:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.