Robust Multi-Agent Bandits Over Undirected Graphs
- URL: http://arxiv.org/abs/2203.00076v1
- Date: Mon, 28 Feb 2022 20:21:55 GMT
- Title: Robust Multi-Agent Bandits Over Undirected Graphs
- Authors: Daniel Vial, Sanjay Shakkottai, R. Srikant
- Abstract summary: We consider a multi-agent multi-armed bandit setting in which honest agents collaborate over a network to minimize regret.
We show the situation is murkier beyond the case of a complete graph.
We propose a new algorithm for which the $i$-th agent has regret $O( d_textmal(i) + K/n) log(T)/Delta)$ on any connected and undirected graph.
- Score: 26.26185074977412
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider a multi-agent multi-armed bandit setting in which $n$ honest
agents collaborate over a network to minimize regret but $m$ malicious agents
can disrupt learning arbitrarily. Assuming the network is the complete graph,
existing algorithms incur $O( (m + K/n) \log (T) / \Delta )$ regret in this
setting, where $K$ is the number of arms and $\Delta$ is the arm gap. For $m
\ll K$, this improves over the single-agent baseline regret of
$O(K\log(T)/\Delta)$.
In this work, we show the situation is murkier beyond the case of a complete
graph. In particular, we prove that if the state-of-the-art algorithm is used
on the undirected line graph, honest agents can suffer (nearly) linear regret
until time is doubly exponential in $K$ and $n$. In light of this negative
result, we propose a new algorithm for which the $i$-th agent has regret $O( (
d_{\text{mal}}(i) + K/n) \log(T)/\Delta)$ on any connected and undirected
graph, where $d_{\text{mal}}(i)$ is the number of $i$'s neighbors who are
malicious. Thus, we generalize existing regret bounds beyond the complete graph
(where $d_{\text{mal}}(i) = m$), and show the effect of malicious agents is
entirely local (in the sense that only the $d_{\text{mal}}(i)$ malicious agents
directly connected to $i$ affect its long-term regret).
Related papers
- Online Inverse Linear Optimization: Improved Regret Bound, Robustness to Suboptimality, and Toward Tight Regret Analysis [25.50155563108198]
We study an online learning problem where, over $T$ rounds, a learner observes both time-varying sets of feasible actions and an agent's optimal actions.
We obtain an $O(nln T)$ regret bound, improving upon the previous bound of $O(n4ln T)$ by a factor of $n3$.
arXiv Detail & Related papers (2025-01-24T09:19:15Z) - p-Mean Regret for Stochastic Bandits [52.828710025519996]
We introduce a simple, unified UCB-based algorithm that achieves novel $p$-mean regret bounds.
Our framework encompasses both average cumulative regret and Nash regret as special cases.
arXiv Detail & Related papers (2024-12-14T08:38:26Z) - Fair Multi-Agent Bandits [14.614647884175657]
We provide an algorithm with regret $Oleft(N3 log fracBDelta f(log T) log T right)$, where $f(t)$ is any function diverging to infinity with $t$.
This significantly improves previous results which had the same upper bound on the regret of order $O(f(log T) log T )$ but an exponential dependence on the number of agents.
arXiv Detail & Related papers (2023-06-07T15:05:53Z) - Online Learning with Feedback Graphs: The True Shape of Regret [82.00098840619847]
We prove that the minimax regret is proportional to $R*$ for any graph and time horizon $T$.
Introducing an intricate exploration strategy, we define the mainAlgorithm algorithm that achieves the minimax optimal regret bound.
arXiv Detail & Related papers (2023-06-05T15:35:00Z) - UniRank: Unimodal Bandit Algorithm for Online Ranking [0.0]
We create an algorithm with an expected regret matching $O(fracLlog(L)Deltalog(T))$ with $2L$ players, $T$ and a minimum reward gap $Delta$.
Some experimental results finally show these theoretical results are corroborated in practice.
arXiv Detail & Related papers (2022-08-02T15:01:33Z) - Unimodal Mono-Partite Matching in a Bandit Setting [0.0]
We create an algorithm with an expected regret matching $O(fracLlog(L)Deltalog(T))$ with $2L$ players, $T$ and a minimum reward gap $Delta$.
Some experimental results finally show these theoretical results are corroborated in practice.
arXiv Detail & Related papers (2022-08-02T14:55:50Z) - Corralling a Larger Band of Bandits: A Case Study on Switching Regret
for Linear Bandits [99.86860277006318]
We consider the problem of combining and learning over a set of adversarial algorithms with the goal of adaptively tracking the best one on the fly.
The CORRAL of Agarwal et al. achieves this goal with a regret overhead of order $widetildeO(sqrtd S T)$ where $M$ is the number of base algorithms and $T$ is the time horizon.
Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only emphlogarithmic dependence on $M$ as long
arXiv Detail & Related papers (2022-02-12T21:55:44Z) - Logarithmic Regret from Sublinear Hints [76.87432703516942]
We show that an algorithm can obtain $O(log T)$ regret with just $O(sqrtT)$ hints under a natural query model.
We also show that $o(sqrtT)$ hints cannot guarantee better than $Omega(sqrtT)$ regret.
arXiv Detail & Related papers (2021-11-09T16:50:18Z) - Differentially Private Multi-Armed Bandits in the Shuffle Model [58.22098764071924]
We give an $(varepsilon,delta)$-differentially private algorithm for the multi-armed bandit (MAB) problem in the shuffle model.
Our upper bound almost matches the regret of the best known algorithms for the centralized model, and significantly outperforms the best known algorithm in the local model.
arXiv Detail & Related papers (2021-06-05T14:11:01Z) - Communication Efficient Parallel Reinforcement Learning [34.77250498401055]
We consider the problem where $M$ agents interact with $M$ identical and independent environments with $S$ states and $A$ actions.
We aim to find an algorithm that allows the agents to minimize the regret with infrequent communication rounds.
arXiv Detail & Related papers (2021-02-22T02:46:36Z) - Impact of Representation Learning in Linear Bandits [83.17684841392754]
We study how representation learning can improve the efficiency of bandit problems.
We present a new algorithm which achieves $widetildeO(TsqrtkN + sqrtdkNT)$ regret, where $N$ is the number of rounds we play for each bandit.
arXiv Detail & Related papers (2020-10-13T16:35:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.