Bandit problems with fidelity rewards
- URL: http://arxiv.org/abs/2111.13026v1
- Date: Thu, 25 Nov 2021 11:09:43 GMT
- Title: Bandit problems with fidelity rewards
- Authors: G\'abor Lugosi, Ciara Pike-Burke, Pierre-Andr\'e Savalle
- Abstract summary: The fidelity bandits problem is a variant of the $K$-armed bandit problem in which the reward of each arm is augmented by a fidelity reward depending on how 'loyal' the player has been to that arm in the past.
In the loyalty-points model the amount of extra reward depends on the number of times the arm has previously been played.
In the subscription model the additional reward depends on the current number of consecutive draws of the arm.
- Score: 7.154621689269006
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The fidelity bandits problem is a variant of the $K$-armed bandit problem in
which the reward of each arm is augmented by a fidelity reward that provides
the player with an additional payoff depending on how 'loyal' the player has
been to that arm in the past. We propose two models for fidelity. In the
loyalty-points model the amount of extra reward depends on the number of times
the arm has previously been played. In the subscription model the additional
reward depends on the current number of consecutive draws of the arm. We
consider both stochastic and adversarial problems. Since single-arm strategies
are not always optimal in stochastic problems, the notion of regret in the
adversarial setting needs careful adjustment. We introduce three possible
notions of regret and investigate which can be bounded sublinearly. We study in
detail the special cases of increasing, decreasing and coupon (where the player
gets an additional reward after every $m$ plays of an arm) fidelity rewards.
For the models which do not necessarily enjoy sublinear regret, we provide a
worst case lower bound. For those models which exhibit sublinear regret, we
provide algorithms and bound their regret.
Related papers
- Imprecise Multi-Armed Bandits [0.0]
We introduce a novel multi-armed bandit framework, where each arm is associated with a fixed unknown credal set over the space of outcomes.
We then define a notion of regret corresponding to the lower prevision defined by these credal sets.
arXiv Detail & Related papers (2024-05-09T10:58:40Z) - Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate
Reward Hacking [63.666119126351965]
Reward models play a key role in aligning language model applications towards human preferences.
A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate.
We show that reward ensembles do not eliminate reward hacking because all reward models in the ensemble exhibit similar error patterns.
arXiv Detail & Related papers (2023-12-14T18:59:04Z) - Bandits Meet Mechanism Design to Combat Clickbait in Online
Recommendation [50.469872635246176]
We study a strategic variant of the multi-armed bandit problem, which we coin the strategic click-bandit.
This model is motivated by applications in online recommendation where the choice of recommended items depends on both the click-through rates and the post-click rewards.
arXiv Detail & Related papers (2023-11-27T09:19:01Z) - BanditQ: Fair Bandits with Guaranteed Rewards [10.74025233418392]
Classic no-regret multi-armed bandit algorithms are inherently unfair by design.
We propose a new online policy, called BanditQ, that achieves the target reward rates while conceding a regret and target rate violation penalty.
The proposed policy is efficient and admits a black-box reduction from the fair prediction problem to the standard adversarial MAB problem.
arXiv Detail & Related papers (2023-04-11T13:39:47Z) - Multiple-Play Stochastic Bandits with Shareable Finite-Capacity Arms [32.313813562222066]
We generalize the multiple-play multi-armed bandits (MP-MAB) problem with a shareable arm setting.
Each shareable arm has a finite reward capacity and a ''per-load'' reward distribution.
We propose an online learning algorithm to address the problem and prove its regret upper bound.
arXiv Detail & Related papers (2022-06-17T13:47:27Z) - The Pareto Frontier of Instance-Dependent Guarantees in Multi-Player
Multi-Armed Bandits with no Communication [10.446001329147112]
We study the multi-player multi-armed bandit problem.
In this problem, $m$ players cooperate to maximize their total reward from $K > m$ arms.
We ask whether it is possible to obtain optimal instance-dependent regret $tildeO (1/Delta)$ where $Delta$ is the gap between the $m$-th and $m+1$-st best arms.
arXiv Detail & Related papers (2022-02-19T18:19:36Z) - Combinatorial Bandits without Total Order for Arms [52.93972547896022]
We present a reward model that captures set-dependent reward distribution and assumes no total order for arms.
We develop a novel regret analysis and show an $Oleft(frack2 n log Tepsilonright)$ gap-dependent regret bound as well as an $Oleft(k2sqrtn T log Tright)$ gap-independent regret bound.
arXiv Detail & Related papers (2021-03-03T23:08:59Z) - Online Model Selection: a Rested Bandit Formulation [49.69377391589057]
We introduce and analyze a best arm identification problem in the rested bandit setting.
We define a novel notion of regret for this problem, where we compare to the policy that always plays the arm having the smallest expected loss at the end of the game.
Unlike known model selection efforts in the recent bandit literature, our algorithm exploits the specific structure of the problem to learn the unknown parameters of the expected loss function.
arXiv Detail & Related papers (2020-12-07T08:23:08Z) - Robustness Guarantees for Mode Estimation with an Application to Bandits [131.21717367564963]
We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean.
We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences.
arXiv Detail & Related papers (2020-03-05T21:29:27Z) - Tight Lower Bounds for Combinatorial Multi-Armed Bandits [72.56064196252498]
The Combinatorial Multi-Armed Bandit problem is a sequential decision-making problem in which an agent selects a set of arms on each round.
We show that the recently proposed Gini-weighted smoothness parameter determines the lower bounds for monotone reward functions.
arXiv Detail & Related papers (2020-02-13T08:53:43Z) - Selfish Robustness and Equilibria in Multi-Player Bandits [25.67398941667429]
In a game, several players simultaneously pull arms and encounter a collision - with 0 reward - if some of them pull the same arm at the same time.
While the cooperative case where players maximize the collective reward has been mostly considered, to malicious players is a crucial and challenging concern.
We shall consider instead the more natural class of selfish players whose incentives are to maximize their individual rewards, potentially at the expense of the social welfare.
arXiv Detail & Related papers (2020-02-04T09:50:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.