Finite-Time Frequentist Regret Bounds of Multi-Agent Thompson Sampling
on Sparse Hypergraphs
- URL: http://arxiv.org/abs/2312.15549v1
- Date: Sun, 24 Dec 2023 21:41:01 GMT
- Title: Finite-Time Frequentist Regret Bounds of Multi-Agent Thompson Sampling
on Sparse Hypergraphs
- Authors: Tianyuan Jin, Hao-Lun Hsu, William Chang, Pan Xu
- Abstract summary: We study the multi-agent multi-armed bandit (MAMAB) problem, where $m$ agents are factored into $rho$ overlapping groups.
We propose an efficient variant of MATS, the $epsilon$-exploring Multi-Agent Thompson Sampling ($epsilon$-MATS) algorithm.
We prove that $epsilon$-MATS achieves a worst-case frequentist regret bound that is sublinear in both the time horizon and the local arm size.
- Score: 11.024467775280193
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the multi-agent multi-armed bandit (MAMAB) problem, where $m$ agents
are factored into $\rho$ overlapping groups. Each group represents a hyperedge,
forming a hypergraph over the agents. At each round of interaction, the learner
pulls a joint arm (composed of individual arms for each agent) and receives a
reward according to the hypergraph structure. Specifically, we assume there is
a local reward for each hyperedge, and the reward of the joint arm is the sum
of these local rewards. Previous work introduced the multi-agent Thompson
sampling (MATS) algorithm \citep{verstraeten2020multiagent} and derived a
Bayesian regret bound. However, it remains an open problem how to derive a
frequentist regret bound for Thompson sampling in this multi-agent setting. To
address these issues, we propose an efficient variant of MATS, the
$\epsilon$-exploring Multi-Agent Thompson Sampling ($\epsilon$-MATS) algorithm,
which performs MATS exploration with probability $\epsilon$ while adopts a
greedy policy otherwise. We prove that $\epsilon$-MATS achieves a worst-case
frequentist regret bound that is sublinear in both the time horizon and the
local arm size. We also derive a lower bound for this setting, which implies
our frequentist regret upper bound is optimal up to constant and logarithm
terms, when the hypergraph is sufficiently sparse. Thorough experiments on
standard MAMAB problems demonstrate the superior performance and the improved
computational efficiency of $\epsilon$-MATS compared with existing algorithms
in the same setting.
Related papers
- Langevin Thompson Sampling with Logarithmic Communication: Bandits and
Reinforcement Learning [34.4255062106615]
Thompson sampling (TS) is widely used in sequential decision making due to its ease of use and appealing empirical performance.
We propose batched $textitLangevin Thompson Sampling$ algorithms that leverage MCMC methods to sample from approximate posteriors with only logarithmic communication costs in terms of batches.
Our algorithms are computationally efficient and maintain the same order-optimal regret guarantees of $mathcalO(log T)$ for MABs, and $mathcalO(sqrtT)$ for RL.
arXiv Detail & Related papers (2023-06-15T01:16:29Z) - Fair Multi-Agent Bandits [14.614647884175657]
We provide an algorithm with regret $Oleft(N3 log fracBDelta f(log T) log T right)$, where $f(t)$ is any function diverging to infinity with $t$.
This significantly improves previous results which had the same upper bound on the regret of order $O(f(log T) log T )$ but an exponential dependence on the number of agents.
arXiv Detail & Related papers (2023-06-07T15:05:53Z) - Finite-Time Regret of Thompson Sampling Algorithms for Exponential
Family Multi-Armed Bandits [88.21288104408556]
We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family.
We propose a Thompson sampling, termed Expulli, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm.
arXiv Detail & Related papers (2022-06-07T18:08:21Z) - A Unifying Theory of Thompson Sampling for Continuous Risk-Averse
Bandits [91.3755431537592]
This paper unifies the analysis of risk-averse Thompson sampling algorithms for the multi-armed bandit problem.
Using the contraction principle in the theory of large deviations, we prove novel concentration bounds for continuous risk functionals.
We show that a wide class of risk functionals as well as "nice" functions of them satisfy the continuity condition.
arXiv Detail & Related papers (2021-08-25T17:09:01Z) - Lenient Regret for Multi-Armed Bandits [72.56064196252498]
We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took.
While the majority of algorithms try to minimize the regret, i.e., the cumulative difference between the reward of the best action and the agent's action, this criterion might lead to undesirable results.
We suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some $epsilon$.
arXiv Detail & Related papers (2020-08-10T08:30:52Z) - An Efficient Algorithm For Generalized Linear Bandit: Online Stochastic
Gradient Descent and Thompson Sampling [83.48992319018147]
We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward.
A natural way to resolve this problem is to apply online gradient descent (SGD) so that the per-step time and memory complexity can be reduced to constant.
In this work, we show that online SGD can be applied to the generalized linear bandit problem.
The proposed SGD-TS algorithm, which uses a single-step SGD update to exploit past information, achieves $tildeO(sqrtT)$ regret with the total time complexity that
arXiv Detail & Related papers (2020-06-07T01:12:39Z) - MOTS: Minimax Optimal Thompson Sampling [89.2370817955411]
It has remained an open problem whether Thompson sampling can match the minimax lower bound $Omega(sqrtKT)$ for $K$-armed bandit problems.
We propose a variant of Thompson sampling called MOTS that adaptively clips the sampling instance of the chosen arm at each time step.
We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound $O(sqrtKT)$ for finite time horizon $T$, as well as the optimal regret bound for Gaussian rewards when $T$ approaches infinity.
arXiv Detail & Related papers (2020-03-03T21:24:39Z) - The Price of Incentivizing Exploration: A Characterization via Thompson
Sampling and Sample Complexity [83.81297078039836]
We consider incentivized exploration: a version of multi-armed bandits where the choice of arms is controlled by self-interested agents.
We focus on the price of incentives: the loss in performance, broadly construed, incurred for the sake of incentive-compatibility.
arXiv Detail & Related papers (2020-02-03T04:58:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.