On Regret-optimal Cooperative Nonstochastic Multi-armed Bandits
- URL: http://arxiv.org/abs/2211.17154v3
- Date: Sat, 21 Oct 2023 14:49:36 GMT
- Title: On Regret-optimal Cooperative Nonstochastic Multi-armed Bandits
- Authors: Jialin Yi and Milan Vojnovi\'c
- Abstract summary: We show that a collaborative multi-agent emphfollow-the-regularized-leader (FTRL) algorithm has an individual regret upper bound that matches the lower bound up to a constant factor.
We also show that an FTRL algorithm with a suitable regularizer is regret optimal with respect to the scaling with the edge-delay parameter.
- Score: 7.23389716633927
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We consider the nonstochastic multi-agent multi-armed bandit problem with
agents collaborating via a communication network with delays. We show a lower
bound for individual regret of all agents. We show that with suitable
regularizers and communication protocols, a collaborative multi-agent
\emph{follow-the-regularized-leader} (FTRL) algorithm has an individual regret
upper bound that matches the lower bound up to a constant factor when the
number of arms is large enough relative to degrees of agents in the
communication graph. We also show that an FTRL algorithm with a suitable
regularizer is regret optimal with respect to the scaling with the edge-delay
parameter. We present numerical experiments validating our theoretical results
and demonstrate cases when our algorithms outperform previously proposed
algorithms.
Related papers
- DASA: Delay-Adaptive Multi-Agent Stochastic Approximation [64.32538247395627]
We consider a setting in which $N$ agents aim to speedup a common Approximation problem by acting in parallel and communicating with a central server.
To mitigate the effect of delays and stragglers, we propose textttDASA, a Delay-Adaptive algorithm for multi-agent Approximation.
arXiv Detail & Related papers (2024-03-25T22:49:56Z) - Multi-Armed Bandits with Abstention [62.749500564313834]
We introduce a novel extension of the canonical multi-armed bandit problem that incorporates an additional strategic element: abstention.
In this enhanced framework, the agent is not only tasked with selecting an arm at each time step, but also has the option to abstain from accepting the instantaneous reward before observing it.
arXiv Detail & Related papers (2024-02-23T06:27:12Z) - Pure Exploration under Mediators' Feedback [63.56002444692792]
Multi-armed bandits are a sequential-decision-making framework, where, at each interaction step, the learner selects an arm and observes a reward.
We consider the scenario in which the learner has access to a set of mediators, each of which selects the arms on the agent's behalf according to a and possibly unknown policy.
We propose a sequential decision-making strategy for discovering the best arm under the assumption that the mediators' policies are known to the learner.
arXiv Detail & Related papers (2023-08-29T18:18:21Z) - On the Complexity of Multi-Agent Decision Making: From Learning in Games
to Partial Monitoring [105.13668993076801]
A central problem in the theory of multi-agent reinforcement learning (MARL) is to understand what structural conditions and algorithmic principles lead to sample-efficient learning guarantees.
We study this question in a general framework for interactive decision making with multiple agents.
We show that characterizing the statistical complexity for multi-agent decision making is equivalent to characterizing the statistical complexity of single-agent decision making.
arXiv Detail & Related papers (2023-05-01T06:46:22Z) - On-Demand Communication for Asynchronous Multi-Agent Bandits [43.3383282058292]
ODC is an on-demand communication protocol that tailors the communication of each pair of agents based on their empirical pull times.
ODC is efficient when the pull times of agents are highly heterogeneous, and its communication complexity depends on the empirical pull times of agents.
arXiv Detail & Related papers (2023-02-15T03:32:33Z) - Optimal Clustering with Bandit Feedback [57.672609011609886]
This paper considers the problem of online clustering with bandit feedback.
It includes a novel stopping rule for sequential testing that circumvents the need to solve any NP-hard weighted clustering problem as its subroutines.
We show through extensive simulations on synthetic and real-world datasets that BOC's performance matches the lower boundally, and significantly outperforms a non-adaptive baseline algorithm.
arXiv Detail & Related papers (2022-02-09T06:05:05Z) - Asymptotic Optimality for Decentralised Bandits [7.057937612386993]
We consider a large number of agents collaborating on a bandit problem with a large number of arms.
The goal is to minimise the regret of each agent in a communication-constrained setting.
arXiv Detail & Related papers (2021-09-20T11:10:10Z) - Distributed Online Learning for Joint Regret with Communication
Constraints [17.080853582489073]
We consider a distributed online learning setting for joint regret with communication constraints.
A subset of all the agents may then communicate a $b$-bit message to their neighbors in a graph.
We exploit the comparator-adaptive property of our algorithm to learn the best partition from a set of candidate partitions.
arXiv Detail & Related papers (2021-02-15T12:48:33Z) - Kernel Methods for Cooperative Multi-Agent Contextual Bandits [15.609414012418043]
Cooperative multi-agent decision making involves a group of agents cooperatively solving learning problems while communicating over a network with delays.
We consider the kernelised contextual bandit problem, where the reward obtained by an agent is an arbitrary linear function of the contexts' images in the related kernel reproducing Hilbert space (RKHS)
We propose textscCoop- KernelUCB, an algorithm that provides near-optimal bounds on the per-agent regret.
arXiv Detail & Related papers (2020-08-14T07:37:44Z) - Lenient Regret for Multi-Armed Bandits [72.56064196252498]
We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took.
While the majority of algorithms try to minimize the regret, i.e., the cumulative difference between the reward of the best action and the agent's action, this criterion might lead to undesirable results.
We suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some $epsilon$.
arXiv Detail & Related papers (2020-08-10T08:30:52Z) - The Gossiping Insert-Eliminate Algorithm for Multi-Agent Bandits [20.259428328004738]
We consider a decentralized multi-agent Multi Armed Bandit (MAB) setup consisting of $N$ agents.
In our model, agents collaborate by exchanging messages through pairwise gossip style communications on an arbitrary connected graph.
arXiv Detail & Related papers (2020-01-15T17:49:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.