Optimal Cooperative Multiplayer Learning Bandits with Noisy Rewards and
No Communication
- URL: http://arxiv.org/abs/2311.06210v1
- Date: Fri, 10 Nov 2023 17:55:44 GMT
- Title: Optimal Cooperative Multiplayer Learning Bandits with Noisy Rewards and
No Communication
- Authors: William Chang, Yuanhao Lu
- Abstract summary: We consider a cooperative multiplayer bandit learning problem where the players are only allowed to agree on a strategy beforehand.
In this problem, each player simultaneously selects an action.
We show that this algorithm can achieve logarithmic $O(fraclog TDelta_bma)$ (gap-dependent) regret as well as $O(sqrtTlog T)$ (gap-independent) regret.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We consider a cooperative multiplayer bandit learning problem where the
players are only allowed to agree on a strategy beforehand, but cannot
communicate during the learning process. In this problem, each player
simultaneously selects an action. Based on the actions selected by all players,
the team of players receives a reward. The actions of all the players are
commonly observed. However, each player receives a noisy version of the reward
which cannot be shared with other players. Since players receive potentially
different rewards, there is an asymmetry in the information used to select
their actions. In this paper, we provide an algorithm based on upper and lower
confidence bounds that the players can use to select their optimal actions
despite the asymmetry in the reward information. We show that this algorithm
can achieve logarithmic $O(\frac{\log T}{\Delta_{\bm{a}}})$ (gap-dependent)
regret as well as $O(\sqrt{T\log T})$ (gap-independent) regret. This is
asymptotically optimal in $T$. We also show that it performs empirically better
than the current state of the art algorithm for this environment.
Related papers
- Multi-agent Multi-armed Bandits with Stochastic Sharable Arm Capacities [69.34646544774161]
We formulate a new variant of multi-player multi-armed bandit (MAB) model, which captures arrival of requests to each arm and the policy of allocating requests to players.
The challenge is how to design a distributed learning algorithm such that players select arms according to the optimal arm pulling profile.
We design an iterative distributed algorithm, which guarantees that players can arrive at a consensus on the optimal arm pulling profile in only M rounds.
arXiv Detail & Related papers (2024-08-20T13:57:00Z) - Competing for Shareable Arms in Multi-Player Multi-Armed Bandits [29.08799537067425]
We study a novel multi-player multi-armed bandit (MPMAB) setting where players are selfish and aim to maximize their own rewards.
We propose a novel Selfish MPMAB with Averaging Allocation (SMAA) approach based on the equilibrium.
We establish that no single selfish player can significantly increase their rewards through deviation, nor can they detrimentally affect other players' rewards without incurring substantial losses for themselves.
arXiv Detail & Related papers (2023-05-30T15:59:56Z) - The Pareto Frontier of Instance-Dependent Guarantees in Multi-Player
Multi-Armed Bandits with no Communication [10.446001329147112]
We study the multi-player multi-armed bandit problem.
In this problem, $m$ players cooperate to maximize their total reward from $K > m$ arms.
We ask whether it is possible to obtain optimal instance-dependent regret $tildeO (1/Delta)$ where $Delta$ is the gap between the $m$-th and $m+1$-st best arms.
arXiv Detail & Related papers (2022-02-19T18:19:36Z) - Near-Optimal Learning of Extensive-Form Games with Imperfect Information [54.55092907312749]
We present the first line of algorithms that require only $widetildemathcalO((XA+YB)/varepsilon2)$ episodes of play to find an $varepsilon$-approximate Nash equilibrium in two-player zero-sum games.
This improves upon the best known sample complexity of $widetildemathcalO((X2A+Y2B)/varepsilon2)$ by a factor of $widetildemathcalO(maxX,
arXiv Detail & Related papers (2022-02-03T18:18:28Z) - Can Reinforcement Learning Find Stackelberg-Nash Equilibria in
General-Sum Markov Games with Myopic Followers? [156.5760265539888]
We study multi-player general-sum Markov games with one of the players designated as the leader and the other players regarded as followers.
For such a game, our goal is to find a Stackelberg-Nash equilibrium (SNE), which is a policy pair $(pi*, nu*)$.
We develop sample-efficient reinforcement learning (RL) algorithms for solving for an SNE in both online and offline settings.
arXiv Detail & Related papers (2021-12-27T05:41:14Z) - Doubly Optimal No-Regret Online Learning in Strongly Monotone Games with Bandit Feedback [29.553652241608997]
We study the class of textitsmooth and strongly monotone games and study optimal no-regret learning therein.
We first construct a new bandit learning algorithm and show that it achieves the single-agent optimal regret of $tildeTheta(nsqrtT)$.
Our results thus settle this open problem and contribute to the broad landscape of bandit game-theoretical learning.
arXiv Detail & Related papers (2021-12-06T08:27:54Z) - Online Learning for Cooperative Multi-Player Multi-Armed Bandits [7.527034429851578]
We introduce a framework for decentralized online learning for multi-armed bandits (MAB) with multiple cooperative players.
The reward obtained by the players in each round depends on the actions taken by all the players.
We consider three types of information asymmetry: action information asymmetry when the actions of the players can't be observed but the rewards received are common; reward information asymmetry when the actions of the other players are observable but rewards received are IID from the same distribution.
arXiv Detail & Related papers (2021-09-07T18:18:58Z) - Bandit Learning in Decentralized Matching Markets [82.39061186055775]
We study two-sided matching markets in which one side of the market (the players) does not have a priori knowledge about its preferences for the other side (the arms) and is required to learn its preferences from experience.
This model extends the standard multi-armed bandit framework to a decentralized multiple player setting with competition.
We show that the algorithm is incentive compatible whenever the arms' preferences are shared, but not necessarily so when preferences are fully general.
arXiv Detail & Related papers (2020-12-14T08:58:07Z) - Faster Algorithms for Optimal Ex-Ante Coordinated Collusive Strategies
in Extensive-Form Zero-Sum Games [123.76716667704625]
We focus on the problem of finding an optimal strategy for a team of two players that faces an opponent in an imperfect-information zero-sum extensive-form game.
In that setting, it is known that the best the team can do is sample a profile of potentially randomized strategies (one per player) from a joint (a.k.a. correlated) probability distribution at the beginning of the game.
We provide an algorithm that computes such an optimal distribution by only using profiles where only one of the team members gets to randomize in each profile.
arXiv Detail & Related papers (2020-09-21T17:51:57Z) - Learning to Play Sequential Games versus Unknown Opponents [93.8672371143881]
We consider a repeated sequential game between a learner, who plays first, and an opponent who responds to the chosen action.
We propose a novel algorithm for the learner when playing against an adversarial sequence of opponents.
Our results include algorithm's regret guarantees that depend on the regularity of the opponent's response.
arXiv Detail & Related papers (2020-07-10T09:33:05Z) - Multiplayer Bandit Learning, from Competition to Cooperation [3.7801191959442053]
We study the effects of competition and cooperation on the tradeoff between exploration and exploitation.
The model is related to the economics literature on strategic experimentation, where usually players observe each other's rewards.
arXiv Detail & Related papers (2019-08-03T08:20:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.