Related papers: Fairness and Welfare Quantification for Regret in Multi-Armed Bandits

Fairness and Welfare Quantification for Regret in Multi-Armed Bandits

URL: http://arxiv.org/abs/2205.13930v1
Date: Fri, 27 May 2022 12:12:56 GMT
Title: Fairness and Welfare Quantification for Regret in Multi-Armed Bandits
Authors: Siddharth Barman, Arindam Khan, Arnab Maiti and Ayush Sawarni
Abstract summary: We extend the notion of regret with a welfarist perspective. We study Nash regret, defined as the difference between the -- a priori unknown -- optimal mean (among the arms) and the algorithm's performance.
Score: 13.094997642327371
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We extend the notion of regret with a welfarist perspective. Focussing on the classic multi-armed bandit (MAB) framework, the current work quantifies the performance of bandit algorithms by applying a fundamental welfare function, namely the Nash social welfare (NSW) function. This corresponds to equating algorithm's performance to the geometric mean of its expected rewards and leads us to the study of Nash regret, defined as the difference between the -- a priori unknown -- optimal mean (among the arms) and the algorithm's performance. Since NSW is known to satisfy fairness axioms, our approach complements the utilitarian considerations of average (cumulative) regret, wherein the algorithm is evaluated via the arithmetic mean of its expected rewards. This work develops an algorithm that, given the horizon of play $T$, achieves a Nash regret of $O \left( \sqrt{\frac{{k \log T}}{T}} \right)$, here $k$ denotes the number of arms in the MAB instance. Since, for any algorithm, the Nash regret is at least as much as its average regret (the AM-GM inequality), the known lower bound on average regret holds for Nash regret as well. Therefore, our Nash regret guarantee is essentially tight. In addition, we develop an anytime algorithm with a Nash regret guarantee of $O \left( \sqrt{\frac{{k\log T}}{T}} \log T \right)$.

Related papers

Single Index Bandits: Generalized Linear Contextual Bandits with Unknown Reward Functions [8.48717433940334]
We introduce a new problem of generalized linear bandits with unknown reward functions, also known as single index bandits.<n>We first consider the case where the unknown reward function is monotonically increasing, and propose two novel and efficient algorithms, STOR and ESTOR.<n>We then extend our methods to the high-dimensional sparse setting and show that the same regret rate can be attained with the sparsity index.
arXiv Detail & Related papers (2025-06-15T07:19:00Z)
p-Mean Regret for Stochastic Bandits [52.828710025519996]
We introduce a simple, unified UCB-based algorithm that achieves novel $p$-mean regret bounds. Our framework encompasses both average cumulative regret and Nash regret as special cases.
arXiv Detail & Related papers (2024-12-14T08:38:26Z)
No-Regret Learning for Fair Multi-Agent Social Welfare Optimization [35.44934658941197]
We consider the problem of online multi-agent Nash social welfare (NSW) We show that no algorithm can achieve sublinear regret. We also show that logarithmic regret is possible whenever there exists one agent who is indifferent about different arms.
arXiv Detail & Related papers (2024-05-31T08:21:11Z)
Incentive-compatible Bandits: Importance Weighting No More [14.344759978208957]
We study the problem of incentive-compatible online learning with bandit feedback. In this work we propose the first incentive-compatible algorithms that enjoy $O(sqrtKT)$ regret bounds.
arXiv Detail & Related papers (2024-05-10T13:57:13Z)
Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds. We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$. Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z)
Nash Regret Guarantees for Linear Bandits [12.494184403263338]
We develop an algorithm that achieves a Nash regret of $Oleft( sqrtfracdnuT log( T |X|)right)$. In addition, addressing linear bandit instances in which the set of arms $X$ is not necessarily finite, we obtain a Nash regret upper bound $Oleft( fracdfrac54nufrac12sqrtT log(T)right)$.
arXiv Detail & Related papers (2023-10-03T12:58:10Z)
Learning Two-Player Mixture Markov Games: Kernel Function Approximation and Correlated Equilibrium [157.0902680672422]
We consider learning Nash equilibria in two-player zero-sum Markov Games with nonlinear function approximation. We propose a novel online learning algorithm to find a Nash equilibrium by minimizing the duality gap.
arXiv Detail & Related papers (2022-08-10T14:21:54Z)
Misspecified Gaussian Process Bandit Optimization [59.30399661155574]
Kernelized bandit algorithms have shown strong empirical and theoretical performance for this problem. We introduce a emphmisspecified kernelized bandit setting where the unknown function can be $epsilon$--uniformly approximated by a function with a bounded norm in some Reproducing Kernel Hilbert Space (RKHS) We show that our algorithm achieves optimal dependence on $epsilon$ with no prior knowledge of misspecification.
arXiv Detail & Related papers (2021-11-09T09:00:02Z)
Optimal Order Simple Regret for Gaussian Process Bandits [6.84224661918159]
We show that a pure exploration algorithm is significantly tighter than the existing bounds. We show that this regret is optimal up to logarithmically for cases where a lower bound on a kernel is known.
arXiv Detail & Related papers (2021-08-20T16:49:32Z)
Thresholded Lasso Bandit [70.17389393497125]
Thresholded Lasso bandit is an algorithm that estimates the vector defining the reward function as well as its sparse support. We establish non-asymptotic regret upper bounds scaling as $mathcalO( log d + sqrtT )$ in general, and as $mathcalO( log d + sqrtT )$ under the so-called margin condition.
arXiv Detail & Related papers (2020-10-22T19:14:37Z)
Impact of Representation Learning in Linear Bandits [83.17684841392754]
We study how representation learning can improve the efficiency of bandit problems. We present a new algorithm which achieves $widetildeO(TsqrtkN + sqrtdkNT)$ regret, where $N$ is the number of rounds we play for each bandit.
arXiv Detail & Related papers (2020-10-13T16:35:30Z)
Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.