Revisiting Social Welfare in Bandits: UCB is (Nearly) All You Need
- URL: http://arxiv.org/abs/2510.21312v1
- Date: Fri, 24 Oct 2025 10:15:22 GMT
- Title: Revisiting Social Welfare in Bandits: UCB is (Nearly) All You Need
- Authors: Dhruv Sarkar, Nishant Pandey, Sayak Ray Chowdhury,
- Abstract summary: We introduce Nash regret, which evaluates performance via the geometric mean of accumulated rewards.<n>We demonstrate that an initial uniform exploration phase followed by a standard Upper Confidence Bound (UCB) algorithm achieves near-optimal Nash regret.<n>We generalize the algorithm to a broad class of fairness metrics called the $p$-mean regret, proving (nearly) optimal regret bounds uniformly across all $p$ values.
- Score: 9.443816944974419
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Regret in stochastic multi-armed bandits traditionally measures the difference between the highest reward and either the arithmetic mean of accumulated rewards or the final reward. These conventional metrics often fail to address fairness among agents receiving rewards, particularly in settings where rewards are distributed across a population, such as patients in clinical trials. To address this, a recent body of work has introduced Nash regret, which evaluates performance via the geometric mean of accumulated rewards, aligning with the Nash social welfare function known for satisfying fairness axioms. To minimize Nash regret, existing approaches require specialized algorithm designs and strong assumptions, such as multiplicative concentration inequalities and bounded, non-negative rewards, making them unsuitable for even Gaussian reward distributions. We demonstrate that an initial uniform exploration phase followed by a standard Upper Confidence Bound (UCB) algorithm achieves near-optimal Nash regret, while relying only on additive Hoeffding bounds, and naturally extending to sub-Gaussian rewards. Furthermore, we generalize the algorithm to a broad class of fairness metrics called the $p$-mean regret, proving (nearly) optimal regret bounds uniformly across all $p$ values. This is in contrast to prior work, which made extremely restrictive assumptions on the bandit instances and even then achieved suboptimal regret bounds.
Related papers
- Incentivized Lipschitz Bandits [7.465238700168576]
We study incentivized exploration in multi-armed bandit (MAB) settings with infinitely many arms modeled as elements in continuous metric spaces.<n>We propose novel incentivized exploration algorithms that discretize the infinite arm space uniformly and demonstrate that these algorithms simultaneously achieve sublinear cumulative regret and sublinear total compensation.
arXiv Detail & Related papers (2025-08-26T22:54:29Z) - Continuous K-Max Bandits [54.21533414838677]
We study the $K$-Max multi-armed bandits problem with continuous outcome distributions and weak value-index feedback.<n>This setting captures critical applications in recommendation systems, distributed computing, server scheduling, etc.<n>Our key contribution is the computationally efficient algorithm DCK-UCB, which combines adaptive discretization with bias-corrected confidence bounds.
arXiv Detail & Related papers (2025-02-19T06:37:37Z) - Tracking Most Significant Shifts in Infinite-Armed Bandits [3.591122855617648]
We study an infinite-armed bandit problem where actions' mean rewards are initially sampled from a reservoir distribution.<n>We show the first parameter-free optimal regret bounds for all regimes of regularity of the reservoir.<n>We show that tighter regret bounds in terms of significant shifts can be adaptively attained by employing a randomized variant of elimination.
arXiv Detail & Related papers (2025-01-31T19:00:21Z) - Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds.
We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$.
Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z) - Zero-Inflated Bandits [11.60342504007264]
We study zero-inflated bandits, where the reward is modeled using a classic semi-parametric distribution known as the zero-inflated distribution.<n>We develop algorithms based on the Upper Confidence Bound and Thompson Sampling frameworks for this specific structure.
arXiv Detail & Related papers (2023-12-25T03:13:21Z) - Nash Regret Guarantees for Linear Bandits [12.494184403263338]
We develop an algorithm that achieves a Nash regret of $Oleft( sqrtfracdnuT log( T |X|)right)$.
In addition, addressing linear bandit instances in which the set of arms $X$ is not necessarily finite, we obtain a Nash regret upper bound $Oleft( fracdfrac54nufrac12sqrtT log(T)right)$.
arXiv Detail & Related papers (2023-10-03T12:58:10Z) - STARC: A General Framework For Quantifying Differences Between Reward Functions [52.69620361363209]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics.<n>We show that STARC metrics induce both an upper and a lower bound on worst-case regret.<n>We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z) - Contextual bandits with concave rewards, and an application to fair
ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR)
We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem.
Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z) - Finite-Time Regret of Thompson Sampling Algorithms for Exponential
Family Multi-Armed Bandits [88.21288104408556]
We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family.
We propose a Thompson sampling, termed Expulli, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm.
arXiv Detail & Related papers (2022-06-07T18:08:21Z) - Fairness and Welfare Quantification for Regret in Multi-Armed Bandits [13.094997642327371]
We extend the notion of regret with a welfarist perspective.
We study Nash regret, defined as the difference between the -- a priori unknown -- optimal mean (among the arms) and the algorithm's performance.
arXiv Detail & Related papers (2022-05-27T12:12:56Z) - Adaptive Discretization against an Adversary: Lipschitz bandits, Dynamic Pricing, and Auction Tuning [56.23358327635815]
Lipschitz bandits is a prominent version of multi-armed bandits that studies large, structured action spaces.<n>A central theme here is the adaptive discretization of the action space, which gradually zooms in'' on the more promising regions.<n>We provide the first algorithm for adaptive discretization in the adversarial version, and derive instance-dependent regret bounds.
arXiv Detail & Related papers (2020-06-22T16:06:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.