BanditQ: Fair Bandits with Guaranteed Rewards
- URL: http://arxiv.org/abs/2304.05219v3
- Date: Sun, 12 May 2024 16:05:30 GMT
- Title: BanditQ: Fair Bandits with Guaranteed Rewards
- Authors: Abhishek Sinha,
- Abstract summary: Classic no-regret multi-armed bandit algorithms are inherently unfair by design.
We propose a new online policy, called BanditQ, that achieves the target reward rates while conceding a regret and target rate violation penalty.
The proposed policy is efficient and admits a black-box reduction from the fair prediction problem to the standard adversarial MAB problem.
- Score: 10.74025233418392
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Classic no-regret multi-armed bandit algorithms, including the Upper Confidence Bound (UCB), Hedge, and EXP3, are inherently unfair by design. Their unfairness stems from their objective of playing the most rewarding arm as frequently as possible while ignoring the rest. In this paper, we consider a fair prediction problem in the stochastic setting with a guaranteed minimum rate of accrual of rewards for each arm. We study the problem in both full-information and bandit feedback settings. Combining queueing-theoretic techniques with adversarial bandits, we propose a new online policy, called BanditQ, that achieves the target reward rates while conceding a regret and target rate violation penalty of at most $O(T^{\frac{3}{4}}).$ The regret bound in the full-information setting can be further improved to $O(\sqrt{T})$ under either a monotonicity assumption or when considering time-averaged regret. The proposed policy is efficient and admits a black-box reduction from the fair prediction problem to the standard adversarial MAB problem. The analysis of the BanditQ policy involves a new self-bounding inequality, which might be of independent interest.
Related papers
- $\alpha$-Fair Contextual Bandits [10.74025233418392]
Contextual bandit algorithms are at the core of many applications, including recommender systems, clinical trials, and optimal portfolio selection.
One of the most popular problems studied in the contextual bandit literature is to maximize the sum of the rewards in each round.
In this paper, we consider the $alpha$-Fairtextual Con Bandits problem, where the objective is to maximize the global $alpha$-fair utility function.
arXiv Detail & Related papers (2023-10-22T03:42:59Z) - Budgeted Multi-Armed Bandits with Asymmetric Confidence Intervals [0.9831489366502302]
We study the Budgeted Multi-Armed Bandit (MAB) problem, where a player chooses from $K$ arms with unknown expected rewards and costs.
We propose a new upper confidence bound (UCB) sampling policy, $omega$-UCB, that uses asymmetric confidence intervals.
These intervals scale with the distance between the sample mean and the bounds of a random variable, yielding a more accurate and tight estimation of the reward-cost ratio.
arXiv Detail & Related papers (2023-06-12T12:35:16Z) - Contextual bandits with concave rewards, and an application to fair
ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR)
We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem.
Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z) - The price of unfairness in linear bandits with biased feedback [62.25313751895011]
We study the problem of sequential decision making with biased linear bandit feedback.
We show that the worst case regret is higher than the dT 1/2 log(T) regret rate obtained under unbiased feedback.
Interestingly, the gap-dependent rates reveal the existence of non-trivial instances where the problem is no more difficult than its unbiased counterpart.
arXiv Detail & Related papers (2022-03-18T08:03:20Z) - Dare not to Ask: Problem-Dependent Guarantees for Budgeted Bandits [66.02233330016435]
We provide problem-dependent guarantees on both the regret and the asked feedback.
We present a new algorithm called BuFALU for which we derive problem-dependent regret and cumulative feedback bounds.
arXiv Detail & Related papers (2021-10-12T03:24:57Z) - Confidence-Budget Matching for Sequential Budgeted Learning [69.77435313099366]
We formalize decision-making problems with querying budget.
We consider multi-armed bandits, linear bandits, and reinforcement learning problems.
We show that CBM based algorithms perform well in the presence of adversity.
arXiv Detail & Related papers (2021-02-05T19:56:31Z) - Nonstationary Stochastic Multiarmed Bandits: UCB Policies and Minimax
Regret [5.1398743023989555]
We study the nonstationary Multi-Armed Bandit (MAB) problem in which the distribution of rewards associated with each arm are assumed to be time-varying.
We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget.
arXiv Detail & Related papers (2021-01-22T07:34:09Z) - A One-Size-Fits-All Solution to Conservative Bandit Problems [32.907883501286]
We study a family of conservative bandit problems (CBPs) with sample-path reward constraints.
We propose a One-Size-Fits-All solution to CBPs and present its applications to three encompassed problems.
arXiv Detail & Related papers (2020-12-14T08:50:23Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Contextual Blocking Bandits [35.235375147227124]
We study a novel variant of the multi-armed bandit problem, where at each time step, the player observes an independently sampled context that determines the arms' mean rewards.
Playing an arm blocks it (across all contexts) for a fixed and known number of future time steps.
We propose a UCB-based variant of the full-information algorithm that guarantees a $mathcalO(log T)$-regret w.r.t. an $alpha$regret strategy in $T time steps, matching the $Omega(log(T)$ lower bound
arXiv Detail & Related papers (2020-03-06T20:34:42Z) - Robustness Guarantees for Mode Estimation with an Application to Bandits [131.21717367564963]
We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean.
We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences.
arXiv Detail & Related papers (2020-03-05T21:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.