An Efficient Algorithm for Fair Multi-Agent Multi-Armed Bandit with Low
Regret
- URL: http://arxiv.org/abs/2209.11817v1
- Date: Fri, 23 Sep 2022 19:15:43 GMT
- Title: An Efficient Algorithm for Fair Multi-Agent Multi-Armed Bandit with Low
Regret
- Authors: Matthew Jones, Huy L\^e Nguyen, Thy Nguyen
- Abstract summary: We propose a new efficient algorithm with lower regret than even previous inefficient ones.
For $N$ agents, $K$ arms, and $T$ rounds, our approach has a regret bound of $tildeO(sqrtNKT + NK)$.
We also complement our efficient algorithm with an inefficient approach with $tildeO(sqrtKT + N2K)$ regret.
- Score: 7.059472280274009
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently a multi-agent variant of the classical multi-armed bandit was
proposed to tackle fairness issues in online learning. Inspired by a long line
of work in social choice and economics, the goal is to optimize the Nash social
welfare instead of the total utility. Unfortunately previous algorithms either
are not efficient or achieve sub-optimal regret in terms of the number of
rounds $T$. We propose a new efficient algorithm with lower regret than even
previous inefficient ones. For $N$ agents, $K$ arms, and $T$ rounds, our
approach has a regret bound of $\tilde{O}(\sqrt{NKT} + NK)$. This is an
improvement to the previous approach, which has regret bound of $\tilde{O}(
\min(NK, \sqrt{N} K^{3/2})\sqrt{T})$. We also complement our efficient
algorithm with an inefficient approach with $\tilde{O}(\sqrt{KT} + N^2K)$
regret. The experimental findings confirm the effectiveness of our efficient
algorithm compared to the previous approaches.
Related papers
- Achieving Tractable Minimax Optimal Regret in Average Reward MDPs [19.663336027878408]
We present the first tractable algorithm with minimax optimal regret of $widetildemathrmO(sqrtmathrmsp(h*) S A T)$.
Remarkably, our algorithm does not require prior information on $mathrmsp(h*)$.
arXiv Detail & Related papers (2024-06-03T11:53:44Z) - Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback [30.337826346496385]
We study online reinforcement learning in linear Markov decision processes with adversarial losses and bandit feedback.
We introduce two algorithms that achieve improved regret performance compared to existing approaches.
arXiv Detail & Related papers (2023-10-17T19:43:37Z) - Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations.
Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z) - An Asymptotically Optimal Batched Algorithm for the Dueling Bandit
Problem [13.69077222007053]
We study the $K$-armed dueling bandit problem, a variation of the traditional multi-armed bandit problem in which feedback is obtained in the form of pairwise comparisons.
We obtain regret of $O(K2log(K)) + O(Klog(T))$ in $O(log(T))$ rounds, where $T$ is the time horizon.
In computational experiments over a variety of real-world datasets, we observe that our algorithm using $O(log(T))$ rounds achieves almost the same performance as fully
arXiv Detail & Related papers (2022-09-25T00:23:55Z) - Corralling a Larger Band of Bandits: A Case Study on Switching Regret
for Linear Bandits [99.86860277006318]
We consider the problem of combining and learning over a set of adversarial algorithms with the goal of adaptively tracking the best one on the fly.
The CORRAL of Agarwal et al. achieves this goal with a regret overhead of order $widetildeO(sqrtd S T)$ where $M$ is the number of base algorithms and $T$ is the time horizon.
Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only emphlogarithmic dependence on $M$ as long
arXiv Detail & Related papers (2022-02-12T21:55:44Z) - Gaussian Process Bandit Optimization with Few Batches [49.896920704012395]
We introduce a batch algorithm inspired by finite-arm bandit algorithms.
We show that it achieves the cumulative regret upper bound $Oast(sqrtTgamma_T)$ using $O(loglog T)$ batches within time horizon $T$.
In addition, we propose a modified version of our algorithm, and characterize how the regret is impacted by the number of batches.
arXiv Detail & Related papers (2021-10-15T00:54:04Z) - Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits
with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints.
For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z) - Impact of Representation Learning in Linear Bandits [83.17684841392754]
We study how representation learning can improve the efficiency of bandit problems.
We present a new algorithm which achieves $widetildeO(TsqrtkN + sqrtdkNT)$ regret, where $N$ is the number of rounds we play for each bandit.
arXiv Detail & Related papers (2020-10-13T16:35:30Z) - Improved Sleeping Bandits with Stochastic Actions Sets and Adversarial
Rewards [59.559028338399855]
We consider the problem of sleeping bandits with action sets and adversarial rewards.
In this paper, we provide a new computationally efficient algorithm inspired by EXP3 satisfying a regret of order $O(sqrtT)$.
arXiv Detail & Related papers (2020-04-14T00:41:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.