Constant or logarithmic regret in asynchronous multiplayer bandits
- URL: http://arxiv.org/abs/2305.19691v1
- Date: Wed, 31 May 2023 09:35:03 GMT
- Title: Constant or logarithmic regret in asynchronous multiplayer bandits
- Authors: Hugo Richard, Etienne Boursier, Vianney Perchet
- Abstract summary: A multiplayer bandits problem was first tackled with an explore-then-commit (ETC) algorithm.
We introduce Cautious Greedy, an algorithm that yields constant instance-dependent regret if the optimal policy assigns at least one player on each arm.
- Score: 22.376992460581594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multiplayer bandits have recently been extensively studied because of their
application to cognitive radio networks.
While the literature mostly considers synchronous players, radio networks
(e.g. for IoT) tend to have asynchronous devices. This motivates the harder,
asynchronous multiplayer bandits problem, which was first tackled with an
explore-then-commit (ETC) algorithm (see Dakdouk, 2022), with a regret
upper-bound in $\mathcal{O}(T^{\frac{2}{3}})$. Before even considering
decentralization, understanding the centralized case was still a challenge as
it was unknown whether getting a regret smaller than $\Omega(T^{\frac{2}{3}})$
was possible.
We answer positively this question, as a natural extension of UCB exhibits a
$\mathcal{O}(\sqrt{T\log(T)})$ minimax regret.
More importantly, we introduce Cautious Greedy, a centralized algorithm that
yields constant instance-dependent regret if the optimal policy assigns at
least one player on each arm (a situation that is proved to occur when arm
means are close enough). Otherwise, its regret increases as the sum of
$\log(T)$ over some sub-optimality gaps. We provide lower bounds showing that
Cautious Greedy is optimal in the data-dependent terms.
Therefore, we set up a strong baseline for asynchronous multiplayer bandits
and suggest that learning the optimal policy in this problem might be easier
than thought, at least with centralization.
Related papers
- Fast Last-Iterate Convergence of Learning in Games Requires Forgetful Algorithms [71.73971094342349]
Self-play via online learning is one of the premier ways to solve large-scale two-player zero-sum games.
We show that OMWU offers several advantages including logarithmic dependence on the size of the payoff matrix.
We prove that a broad class of algorithms that do not forget the past quickly all suffer the same issue.
arXiv Detail & Related papers (2024-06-15T13:26:17Z) - Adversarial Combinatorial Bandits with Switching Costs [55.2480439325792]
We study the problem of adversarial bandit with a switching cost $lambda$ for a switch of each selected arm in each round.
We derive lower bounds for the minimax regret and design algorithms to approach them.
arXiv Detail & Related papers (2024-04-02T12:15:37Z) - Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [56.457634640638254]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM)
We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$.
Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z) - Tight Memory-Regret Lower Bounds for Streaming Bandits [11.537938617281736]
learner aims to minimize regret by dealing with online arriving arms and sublinear arm memory.
We establish the tight worst-case regret lower bound of $Omega left( (TB)alpha K1-alpharight), alpha = 2B / (2B+1-1)$ for any algorithm.
We also provide a multi-pass algorithm that achieves a regret upper bound of $tildeO left( (TB)alpha K1 - alpharight)$ using constant arm memory.
arXiv Detail & Related papers (2023-06-13T16:54:13Z) - A Simple and Provably Efficient Algorithm for Asynchronous Federated
Contextual Linear Bandits [77.09836892653176]
We study federated contextual linear bandits, where $M$ agents cooperate with each other to solve a global contextual linear bandit problem with the help of a central server.
We consider the asynchronous setting, where all agents work independently and the communication between one agent and the server will not trigger other agents' communication.
We prove that the regret of textttFedLinUCB is bounded by $tildeO(dsqrtsum_m=1M T_m)$ and the communication complexity is $tildeO(dM
arXiv Detail & Related papers (2022-07-07T06:16:19Z) - Complete Policy Regret Bounds for Tallying Bandits [51.039677652803675]
Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary.
We study restrictions on the adversary that enable efficient minimization of the emphcomplete policy regret
We provide an algorithm that w.h.p a complete policy regret guarantee of $tildemathcalO(mKsqrtT)$, where the $tildemathcalO$ notation hides only logarithmic factors.
arXiv Detail & Related papers (2022-04-24T03:10:27Z) - The Pareto Frontier of Instance-Dependent Guarantees in Multi-Player
Multi-Armed Bandits with no Communication [10.446001329147112]
We study the multi-player multi-armed bandit problem.
In this problem, $m$ players cooperate to maximize their total reward from $K > m$ arms.
We ask whether it is possible to obtain optimal instance-dependent regret $tildeO (1/Delta)$ where $Delta$ is the gap between the $m$-th and $m+1$-st best arms.
arXiv Detail & Related papers (2022-02-19T18:19:36Z) - Bandit Learning in Decentralized Matching Markets [82.39061186055775]
We study two-sided matching markets in which one side of the market (the players) does not have a priori knowledge about its preferences for the other side (the arms) and is required to learn its preferences from experience.
This model extends the standard multi-armed bandit framework to a decentralized multiple player setting with competition.
We show that the algorithm is incentive compatible whenever the arms' preferences are shared, but not necessarily so when preferences are fully general.
arXiv Detail & Related papers (2020-12-14T08:58:07Z) - Contextual Blocking Bandits [35.235375147227124]
We study a novel variant of the multi-armed bandit problem, where at each time step, the player observes an independently sampled context that determines the arms' mean rewards.
Playing an arm blocks it (across all contexts) for a fixed and known number of future time steps.
We propose a UCB-based variant of the full-information algorithm that guarantees a $mathcalO(log T)$-regret w.r.t. an $alpha$regret strategy in $T time steps, matching the $Omega(log(T)$ lower bound
arXiv Detail & Related papers (2020-03-06T20:34:42Z) - Coordination without communication: optimal regret in two players
multi-armed bandits [1.6752182911522522]
We consider two agents playing simultaneously the same three-armed bandit problem.
We propose a strategy with no collisions at all between the players.
We also argue that the extra logarithmic term $sqrtlog(T)$ should be necessary by proving a lower bound for a full information variant of the problem.
arXiv Detail & Related papers (2020-02-14T17:35:42Z) - Selfish Robustness and Equilibria in Multi-Player Bandits [25.67398941667429]
In a game, several players simultaneously pull arms and encounter a collision - with 0 reward - if some of them pull the same arm at the same time.
While the cooperative case where players maximize the collective reward has been mostly considered, to malicious players is a crucial and challenging concern.
We shall consider instead the more natural class of selfish players whose incentives are to maximize their individual rewards, potentially at the expense of the social welfare.
arXiv Detail & Related papers (2020-02-04T09:50:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.