Related papers: An Efficient Algorithm for Fair Multi-Agent Multi-Armed Bandit with Low Regret

An Efficient Algorithm for Fair Multi-Agent Multi-Armed Bandit with Low Regret

URL: http://arxiv.org/abs/2209.11817v1
Date: Fri, 23 Sep 2022 19:15:43 GMT
Title: An Efficient Algorithm for Fair Multi-Agent Multi-Armed Bandit with Low Regret
Authors: Matthew Jones, Huy L\^e Nguyen, Thy Nguyen
Abstract summary: We propose a new efficient algorithm with lower regret than even previous inefficient ones. For $N$ agents, $K$ arms, and $T$ rounds, our approach has a regret bound of $tildeO(sqrtNKT + NK)$. We also complement our efficient algorithm with an inefficient approach with $tildeO(sqrtKT + N2K)$ regret.
Score: 7.059472280274009
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently a multi-agent variant of the classical multi-armed bandit was proposed to tackle fairness issues in online learning. Inspired by a long line of work in social choice and economics, the goal is to optimize the Nash social welfare instead of the total utility. Unfortunately previous algorithms either are not efficient or achieve sub-optimal regret in terms of the number of rounds $T$. We propose a new efficient algorithm with lower regret than even previous inefficient ones. For $N$ agents, $K$ arms, and $T$ rounds, our approach has a regret bound of $\tilde{O}(\sqrt{NKT} + NK)$. This is an improvement to the previous approach, which has regret bound of $\tilde{O}( \min(NK, \sqrt{N} K^{3/2})\sqrt{T})$. We also complement our efficient algorithm with an inefficient approach with $\tilde{O}(\sqrt{KT} + N^2K)$ regret. The experimental findings confirm the effectiveness of our efficient algorithm compared to the previous approaches.

Related papers

Efficient and Optimal Policy Gradient Algorithm for Corrupted Multi-armed Bandits [29.845787788972594]
We show that SAMBA achieves a state-of-the-art $O(Klog T/Delta) + O(C/Delta)$ regret upper bound.<n>We also conduct simulations to demonstrate the effectiveness of SAMBA.
arXiv Detail & Related papers (2025-02-19T23:16:18Z)
p-Mean Regret for Stochastic Bandits [52.828710025519996]
We introduce a simple, unified UCB-based algorithm that achieves novel $p$-mean regret bounds. Our framework encompasses both average cumulative regret and Nash regret as special cases.
arXiv Detail & Related papers (2024-12-14T08:38:26Z)
Achieving Tractable Minimax Optimal Regret in Average Reward MDPs [19.663336027878408]
We present the first tractable algorithm with minimax optimal regret of $widetildemathrmO(sqrtmathrmsp(h*) S A T)$. Remarkably, our algorithm does not require prior information on $mathrmsp(h*)$.
arXiv Detail & Related papers (2024-06-03T11:53:44Z)
Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback [30.337826346496385]
We study online reinforcement learning in linear Markov decision processes with adversarial losses and bandit feedback. We introduce two algorithms that achieve improved regret performance compared to existing approaches.
arXiv Detail & Related papers (2023-10-17T19:43:37Z)
Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations. Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z)
An Asymptotically Optimal Batched Algorithm for the Dueling Bandit Problem [13.69077222007053]
We study the $K$-armed dueling bandit problem, a variation of the traditional multi-armed bandit problem in which feedback is obtained in the form of pairwise comparisons. We obtain regret of $O(K2log(K)) + O(Klog(T))$ in $O(log(T))$ rounds, where $T$ is the time horizon. In computational experiments over a variety of real-world datasets, we observe that our algorithm using $O(log(T))$ rounds achieves almost the same performance as fully
arXiv Detail & Related papers (2022-09-25T00:23:55Z)
Corralling a Larger Band of Bandits: A Case Study on Switching Regret for Linear Bandits [99.86860277006318]
We consider the problem of combining and learning over a set of adversarial algorithms with the goal of adaptively tracking the best one on the fly. The CORRAL of Agarwal et al. achieves this goal with a regret overhead of order $widetildeO(sqrtd S T)$ where $M$ is the number of base algorithms and $T$ is the time horizon. Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only emphlogarithmic dependence on $M$ as long
arXiv Detail & Related papers (2022-02-12T21:55:44Z)
Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability [59.81339109121384]
We study the $K$ contextual dueling bandit problem, a sequential decision making setting in which the learner uses contextual information to make two decisions, but only observes emphpreference-based feedback suggesting that one decision was better than the other. We provide a new algorithm that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works.
arXiv Detail & Related papers (2021-11-24T07:14:57Z)
Optimal and Efficient Dynamic Regret Algorithms for Non-Stationary Dueling Bandits [27.279654173896372]
We study the problem of emphdynamic regret minimization in $K$-armed Dueling Bandits under non-stationary or time varying preferences. This is an online learning setup where the agent chooses a pair of items at each round and observes only a relative binary win-loss' feedback for this pair.
arXiv Detail & Related papers (2021-11-06T16:46:55Z)
Gaussian Process Bandit Optimization with Few Batches [49.896920704012395]
We introduce a batch algorithm inspired by finite-arm bandit algorithms. We show that it achieves the cumulative regret upper bound $Oast(sqrtTgamma_T)$ using $O(loglog T)$ batches within time horizon $T$. In addition, we propose a modified version of our algorithm, and characterize how the regret is impacted by the number of batches.
arXiv Detail & Related papers (2021-10-15T00:54:04Z)
Bayesian Optimistic Optimisation with Exponentially Decaying Regret [58.02542541410322]
The current practical BO algorithms have regret bounds ranging from $mathcalO(fraclogNsqrtN)$ to $mathcal O(e-sqrtN)$, where $N$ is the number of evaluations. This paper explores the possibility of improving the regret bound in the noiseless setting by intertwining concepts from BO and tree-based optimistic optimisation. We propose the BOO algorithm, a first practical approach which can achieve an exponential regret bound with order $mathcal O(N-sqrt
arXiv Detail & Related papers (2021-05-10T13:07:44Z)
Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints. For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z)
Impact of Representation Learning in Linear Bandits [83.17684841392754]
We study how representation learning can improve the efficiency of bandit problems. We present a new algorithm which achieves $widetildeO(TsqrtkN + sqrtdkNT)$ regret, where $N$ is the number of rounds we play for each bandit.
arXiv Detail & Related papers (2020-10-13T16:35:30Z)
Improved Sleeping Bandits with Stochastic Actions Sets and Adversarial Rewards [59.559028338399855]
We consider the problem of sleeping bandits with action sets and adversarial rewards. In this paper, we provide a new computationally efficient algorithm inspired by EXP3 satisfying a regret of order $O(sqrtT)$.
arXiv Detail & Related papers (2020-04-14T00:41:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.