Mitigating Disparity while Maximizing Reward: Tight Anytime Guarantee
for Improving Bandits
- URL: http://arxiv.org/abs/2208.09254v1
- Date: Fri, 19 Aug 2022 10:23:40 GMT
- Title: Mitigating Disparity while Maximizing Reward: Tight Anytime Guarantee
for Improving Bandits
- Authors: Vishakha Patil, Vineet Nair, Ganesh Ghalme, Arindam Khan
- Abstract summary: We study the Improving Multi-Armed Bandit (IMAB) problem, where the reward obtained from an arm increases with the number of pulls it receives.
Our main contribution is an anytime algorithm for the IMAB problem that achieves the best possible cumulative reward.
- Score: 6.537940428724029
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the Improving Multi-Armed Bandit (IMAB) problem, where the reward
obtained from an arm increases with the number of pulls it receives. This model
provides an elegant abstraction for many real-world problems in domains such as
education and employment, where decisions about the distribution of
opportunities can affect the future capabilities of communities and the
disparity between them. A decision-maker in such settings must consider the
impact of her decisions on future rewards in addition to the standard objective
of maximizing her cumulative reward at any time. In many of these applications,
the time horizon is unknown to the decision-maker beforehand, which motivates
the study of the IMAB problem in the technically more challenging
horizon-unaware setting. We study the tension that arises between two seemingly
conflicting objectives in the horizon-unaware setting: a) maximizing the
cumulative reward at any time based on current rewards of the arms, and b)
ensuring that arms with better long-term rewards get sufficient opportunities
even if they initially have low rewards. We show that, surprisingly, the two
objectives are aligned with each other in this setting. Our main contribution
is an anytime algorithm for the IMAB problem that achieves the best possible
cumulative reward while ensuring that the arms reach their true potential given
sufficient time. Our algorithm mitigates the initial disparity due to lack of
opportunity and continues pulling an arm till it stops improving. We prove the
optimality of our algorithm by showing that a) any algorithm for the IMAB
problem, no matter how utilitarian, must suffer $\Omega(T)$ policy regret and
$\Omega(k)$ competitive ratio with respect to the optimal offline policy, and
b) the competitive ratio of our algorithm is $O(k)$.
Related papers
- Stochastic Bandits for Egalitarian Assignment [58.33714486693828]
We study EgalMAB, an egalitarian assignment problem in the context of multi-armed bandits.
We design and analyze a UCB-based policy EgalUCB and establish upper bounds on the cumulative regret.
arXiv Detail & Related papers (2024-10-08T09:49:47Z) - Robust and Performance Incentivizing Algorithms for Multi-Armed Bandits
with Strategic Agents [57.627352949446625]
We consider a variant of the multi-armed bandit problem.
Specifically, the arms are strategic agents who can improve their rewards or absorb them.
We identify a class of MAB algorithms which satisfy a collection of properties and show that they lead to mechanisms that incentivize top level performance at equilibrium.
arXiv Detail & Related papers (2023-12-13T06:54:49Z) - Bandits Meet Mechanism Design to Combat Clickbait in Online
Recommendation [50.469872635246176]
We study a strategic variant of the multi-armed bandit problem, which we coin the strategic click-bandit.
This model is motivated by applications in online recommendation where the choice of recommended items depends on both the click-through rates and the post-click rewards.
arXiv Detail & Related papers (2023-11-27T09:19:01Z) - Budgeted Multi-Armed Bandits with Asymmetric Confidence Intervals [0.9831489366502302]
We study the Budgeted Multi-Armed Bandit (MAB) problem, where a player chooses from $K$ arms with unknown expected rewards and costs.
We propose a new upper confidence bound (UCB) sampling policy, $omega$-UCB, that uses asymmetric confidence intervals.
These intervals scale with the distance between the sample mean and the bounds of a random variable, yielding a more accurate and tight estimation of the reward-cost ratio.
arXiv Detail & Related papers (2023-06-12T12:35:16Z) - Achieving the Pareto Frontier of Regret Minimization and Best Arm
Identification in Multi-Armed Bandits [91.8283876874947]
We design and analyze the BoBW-lil'UCB$(gamma)$ algorithm.
We show that (i) no algorithm can simultaneously perform optimally for both the RM and BAI objectives.
We also show that BoBW-lil'UCB$(gamma)$ outperforms a competitor in terms of the time complexity and the regret.
arXiv Detail & Related papers (2021-10-16T17:52:32Z) - Addressing the Long-term Impact of ML Decisions via Policy Regret [49.92903850297013]
We study a setting in which the reward from each arm evolves every time the decision-maker pulls that arm.
We argue that an acceptable sequential allocation of opportunities must take an arm's potential for growth into account.
We present an algorithm with provably sub-linear policy regret for sufficiently long time horizons.
arXiv Detail & Related papers (2021-06-02T17:38:10Z) - Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous
Feedback [32.62857394584907]
We study the multi-armed bandit (MAB) problem with composite and anonymous feedback.
We propose adaptive algorithms for both the adversarial and non- adversarial cases.
arXiv Detail & Related papers (2020-12-13T12:25:41Z) - Lenient Regret for Multi-Armed Bandits [72.56064196252498]
We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took.
While the majority of algorithms try to minimize the regret, i.e., the cumulative difference between the reward of the best action and the agent's action, this criterion might lead to undesirable results.
We suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some $epsilon$.
arXiv Detail & Related papers (2020-08-10T08:30:52Z) - A Farewell to Arms: Sequential Reward Maximization on a Budget with a
Giving Up Option [5.1629297054995265]
We consider a sequential decision-making problem where an agent can take one action at a time and each action has a temporal extent.
We introduce an upper confidence based-algorithm, WAIT-UCB, for which we establish logarithmic, problem-dependent regret bound.
arXiv Detail & Related papers (2020-03-06T22:16:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.