Simple Opinion Dynamics for No-Regret Learning
- URL: http://arxiv.org/abs/2306.08670v5
- Date: Fri, 18 Oct 2024 08:00:31 GMT
- Title: Simple Opinion Dynamics for No-Regret Learning
- Authors: John Lazarsfeld, Dan Alistarh,
- Abstract summary: We study a cooperative multi-agent bandit setting in the distributed GOSSIP model.
We introduce and analyze families of memoryless and time-independent protocols for this setting.
For stationary reward settings, we prove for the first time that these simple protocols exhibit best-of-both-worlds behavior.
- Score: 38.61048016579232
- License:
- Abstract: We study a cooperative multi-agent bandit setting in the distributed GOSSIP model: in every round, each of $n$ agents chooses an action from a common set, observes the action's corresponding reward, and subsequently exchanges information with a single randomly chosen neighbor, which may inform its choice in the next round. We introduce and analyze families of memoryless and time-independent protocols for this setting, inspired by opinion dynamics that are well-studied for other algorithmic tasks in the GOSSIP model. For stationary reward settings, we prove for the first time that these simple protocols exhibit best-of-both-worlds behavior, simultaneously obtaining constant cumulative regret scaling like $R(T)/T = \widetilde O(1/T)$, and also reaching consensus on the highest-mean action within $\widetilde O(\sqrt{n})$ rounds. We obtain these results by showing a new connection between the global evolution of these decentralized protocols and a class of zero-sum multiplicative weights update} processes. Using this connection, we establish a general framework for analyzing the population-level regret and other properties of our protocols. Finally, we show our protocols are also surprisingly robust to adversarial rewards, and in this regime we obtain sublinear regret scaling like $R(T)/T = \widetilde O(1/\sqrt{T})$ as long as the number of rounds does not grow too fast as a function of $n$.
Related papers
- Cooperative Multi-Agent Constrained Stochastic Linear Bandits [2.099922236065961]
A network of $N$ agents communicate locally to minimize their collective regret while keeping their expected cost under a specified threshold $tau$.
We propose a safe distributed upper confidence bound algorithm, so called textitMA-OPLB, and establish a high probability bound on its $T$-round regret.
We show that our regret bound is of order $ mathcalOleft(fracdtau-c_0fraclog(NT)2sqrtNsqrtTlog (1/|lambda|)
arXiv Detail & Related papers (2024-10-22T19:34:53Z) - Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback [38.61232011566285]
We study the recently proposed model of RL with Aggregate Bandit Feedback (RL-ABF), where the agent only observes the sum of rewards at the end of an episode instead of each reward individually.
In this paper, we extend ABF to linear function approximation and develop two efficient algorithms with near-optimal regret guarantees.
arXiv Detail & Related papers (2024-05-13T10:51:01Z) - Improved Sample Complexity for Reward-free Reinforcement Learning under
Low-rank MDPs [43.53286390357673]
This paper focuses on reward-free reinforcement learning under low-rank MDP models.
We first provide the first known sample complexity lower bound for any algorithm under low-rank MDPs.
We then propose a novel model-based algorithm, coined RAFFLE, and show it can both find an $epsilon$-optimal policy and achieve an $epsilon$-accurate system identification.
arXiv Detail & Related papers (2023-03-20T04:39:39Z) - On the Sample Complexity of Representation Learning in Multi-task
Bandits with Global and Local structure [77.60508571062958]
We investigate the sample complexity of learning the optimal arm for multi-task bandit problems.
Arms consist of two components: one that is shared across tasks (that we call representation) and one that is task-specific (that we call predictor)
We devise an algorithm OSRL-SC whose sample complexity approaches the lower bound, and scales at most as $H(Glog(delta_G)+ Xlog(delta_H))$, with $X,G,H$ being, respectively, the number of tasks, representations and predictors.
arXiv Detail & Related papers (2022-11-28T08:40:12Z) - Collaborative Multi-agent Stochastic Linear Bandits [28.268809091816287]
We study a collaborative multi-agent linear bandit setting, where $N$ agents that form a network communicate locally to minimize their overall regret.
All the agents observe the corresponding rewards of the played actions and use an accelerated consensus procedure to compute an estimate of the average of the rewards obtained by all the agents.
arXiv Detail & Related papers (2022-05-12T19:46:35Z) - The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret
and Policy Switches [84.54669549718075]
We study the problem of regret minimization for episodic Reinforcement Learning (RL)
We focus on learning with general function classes and general model classes.
We show that a logarithmic regret bound is realizable by algorithms with $O(log T)$ switching cost.
arXiv Detail & Related papers (2022-03-03T02:55:55Z) - Cooperative Online Learning in Stochastic and Adversarial MDPs [50.62439652257712]
We study cooperative online learning in and adversarial Markov decision process (MDP)
In each episode, $m$ agents interact with an MDP simultaneously and share information in order to minimize their individual regret.
We are the first to consider cooperative reinforcement learning (RL) with either non-fresh randomness or in adversarial MDPs.
arXiv Detail & Related papers (2022-01-31T12:32:11Z) - Top $K$ Ranking for Multi-Armed Bandit with Noisy Evaluations [102.32996053572144]
We consider a multi-armed bandit setting where, at the beginning of each round, the learner receives noisy independent evaluations of the true reward of each arm.
We derive different algorithmic approaches and theoretical guarantees depending on how the evaluations are generated.
arXiv Detail & Related papers (2021-12-13T09:48:54Z) - Decentralized Multi-Agent Linear Bandits with Safety Constraints [31.67685495996986]
We study decentralized linear bandits, where a network of $N$ agents acts cooperatively to solve a linear bandit-optimization problem.
We propose DLUCB: a fully decentralized algorithm that minimizes the cumulative regret over the entire network.
We show that our ideas extend naturally to the emerging, albeit more challenging, setting of safe bandits.
arXiv Detail & Related papers (2020-12-01T07:33:00Z) - Weighted QMIX: Expanding Monotonic Value Function Factorisation for Deep
Multi-Agent Reinforcement Learning [66.94149388181343]
We present a new version of a popular $Q$-learning algorithm for MARL.
We show that it can recover the optimal policy even with access to $Q*$.
We also demonstrate improved performance on predator-prey and challenging multi-agent StarCraft benchmark tasks.
arXiv Detail & Related papers (2020-06-18T18:34:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.