Bandit problems with fidelity rewards
- URL: http://arxiv.org/abs/2111.13026v1
- Date: Thu, 25 Nov 2021 11:09:43 GMT
- Title: Bandit problems with fidelity rewards
- Authors: G\'abor Lugosi, Ciara Pike-Burke, Pierre-Andr\'e Savalle
- Abstract summary: The fidelity bandits problem is a variant of the $K$-armed bandit problem in which the reward of each arm is augmented by a fidelity reward depending on how 'loyal' the player has been to that arm in the past.
In the loyalty-points model the amount of extra reward depends on the number of times the arm has previously been played.
In the subscription model the additional reward depends on the current number of consecutive draws of the arm.
- Score: 7.154621689269006
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The fidelity bandits problem is a variant of the $K$-armed bandit problem in
which the reward of each arm is augmented by a fidelity reward that provides
the player with an additional payoff depending on how 'loyal' the player has
been to that arm in the past. We propose two models for fidelity. In the
loyalty-points model the amount of extra reward depends on the number of times
the arm has previously been played. In the subscription model the additional
reward depends on the current number of consecutive draws of the arm. We
consider both stochastic and adversarial problems. Since single-arm strategies
are not always optimal in stochastic problems, the notion of regret in the
adversarial setting needs careful adjustment. We introduce three possible
notions of regret and investigate which can be bounded sublinearly. We study in
detail the special cases of increasing, decreasing and coupon (where the player
gets an additional reward after every $m$ plays of an arm) fidelity rewards.
For the models which do not necessarily enjoy sublinear regret, we provide a
worst case lower bound. For those models which exhibit sublinear regret, we
provide algorithms and bound their regret.
Related papers
- Bridging Rested and Restless Bandits with Graph-Triggering: Rising and Rotting [67.1631453378926]
Graph-Triggered Bandits is a framework to generalize rested and restless bandits.
In this work, we focus on two specific types of monotonic bandits: rising, where the expected reward of an arm grows as the number of triggers increases, and rotting, where the opposite behavior occurs.
arXiv Detail & Related papers (2024-09-09T18:23:07Z) - Multi-agent Multi-armed Bandits with Stochastic Sharable Arm Capacities [69.34646544774161]
We formulate a new variant of multi-player multi-armed bandit (MAB) model, which captures arrival of requests to each arm and the policy of allocating requests to players.
The challenge is how to design a distributed learning algorithm such that players select arms according to the optimal arm pulling profile.
We design an iterative distributed algorithm, which guarantees that players can arrive at a consensus on the optimal arm pulling profile in only M rounds.
arXiv Detail & Related papers (2024-08-20T13:57:00Z) - Honor Among Bandits: No-Regret Learning for Online Fair Division [20.38824614301761]
We consider the problem of online fair division of indivisible goods to players when there are a finite number of types of goods and player values are drawn from distributions with unknown means.
Our main result is the design of an explore-then-commit algorithm that achieves $tildeO(T2/3)$ regret while maintaining either fairness constraint.
arXiv Detail & Related papers (2024-07-01T20:44:52Z) - Imprecise Multi-Armed Bandits [0.0]
We introduce a novel multi-armed bandit framework, where each arm is associated with a fixed unknown credal set over the space of outcomes.
We then define a notion of regret corresponding to the lower prevision defined by these credal sets.
arXiv Detail & Related papers (2024-05-09T10:58:40Z) - Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking [62.146953368613815]
Reward models play a key role in aligning language model applications towards human preferences.
A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate.
We show that reward ensembles do not eliminate reward hacking because all reward models in the ensemble exhibit similar error patterns.
arXiv Detail & Related papers (2023-12-14T18:59:04Z) - BanditQ: Fair Bandits with Guaranteed Rewards [10.74025233418392]
Classic no-regret multi-armed bandit algorithms are inherently unfair by design.
We propose a new online policy, called BanditQ, that achieves the target reward rates while conceding a regret and target rate violation penalty.
The proposed policy is efficient and admits a black-box reduction from the fair prediction problem to the standard adversarial MAB problem.
arXiv Detail & Related papers (2023-04-11T13:39:47Z) - Multiple-Play Stochastic Bandits with Shareable Finite-Capacity Arms [32.313813562222066]
We generalize the multiple-play multi-armed bandits (MP-MAB) problem with a shareable arm setting.
Each shareable arm has a finite reward capacity and a ''per-load'' reward distribution.
We propose an online learning algorithm to address the problem and prove its regret upper bound.
arXiv Detail & Related papers (2022-06-17T13:47:27Z) - Combinatorial Bandits without Total Order for Arms [52.93972547896022]
We present a reward model that captures set-dependent reward distribution and assumes no total order for arms.
We develop a novel regret analysis and show an $Oleft(frack2 n log Tepsilonright)$ gap-dependent regret bound as well as an $Oleft(k2sqrtn T log Tright)$ gap-independent regret bound.
arXiv Detail & Related papers (2021-03-03T23:08:59Z) - Robustness Guarantees for Mode Estimation with an Application to Bandits [131.21717367564963]
We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean.
We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences.
arXiv Detail & Related papers (2020-03-05T21:29:27Z) - Tight Lower Bounds for Combinatorial Multi-Armed Bandits [72.56064196252498]
The Combinatorial Multi-Armed Bandit problem is a sequential decision-making problem in which an agent selects a set of arms on each round.
We show that the recently proposed Gini-weighted smoothness parameter determines the lower bounds for monotone reward functions.
arXiv Detail & Related papers (2020-02-13T08:53:43Z) - Selfish Robustness and Equilibria in Multi-Player Bandits [25.67398941667429]
In a game, several players simultaneously pull arms and encounter a collision - with 0 reward - if some of them pull the same arm at the same time.
While the cooperative case where players maximize the collective reward has been mostly considered, to malicious players is a crucial and challenging concern.
We shall consider instead the more natural class of selfish players whose incentives are to maximize their individual rewards, potentially at the expense of the social welfare.
arXiv Detail & Related papers (2020-02-04T09:50:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.