Bandits Meet Mechanism Design to Combat Clickbait in Online
Recommendation
- URL: http://arxiv.org/abs/2311.15647v1
- Date: Mon, 27 Nov 2023 09:19:01 GMT
- Title: Bandits Meet Mechanism Design to Combat Clickbait in Online
Recommendation
- Authors: Thomas Kleine Buening and Aadirupa Saha and Christos Dimitrakakis and
Haifeng Xu
- Abstract summary: We study a strategic variant of the multi-armed bandit problem, which we coin the strategic click-bandit.
This model is motivated by applications in online recommendation where the choice of recommended items depends on both the click-through rates and the post-click rewards.
- Score: 50.469872635246176
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study a strategic variant of the multi-armed bandit problem, which we coin
the strategic click-bandit. This model is motivated by applications in online
recommendation where the choice of recommended items depends on both the
click-through rates and the post-click rewards. Like in classical bandits,
rewards follow a fixed unknown distribution. However, we assume that the
click-rate of each arm is chosen strategically by the arm (e.g., a host on
Airbnb) in order to maximize the number of times it gets clicked. The algorithm
designer does not know the post-click rewards nor the arms' actions (i.e.,
strategically chosen click-rates) in advance, and must learn both values over
time. To solve this problem, we design an incentive-aware learning algorithm,
UCB-S, which achieves two goals simultaneously: (a) incentivizing desirable arm
behavior under uncertainty; (b) minimizing regret by learning unknown
parameters. We characterize all approximate Nash equilibria among arms under
UCB-S and show a $\tilde{\mathcal{O}} (\sqrt{KT})$ regret bound uniformly in
every equilibrium. We also show that incentive-unaware algorithms generally
fail to achieve low regret in the strategic click-bandit. Finally, we support
our theoretical results by simulations of strategic arm behavior which confirm
the effectiveness and robustness of our proposed incentive design.
Related papers
- Bridging Rested and Restless Bandits with Graph-Triggering: Rising and Rotting [67.1631453378926]
Graph-Triggered Bandits is a framework to generalize rested and restless bandits.
In this work, we focus on two specific types of monotonic bandits: rising, where the expected reward of an arm grows as the number of triggers increases, and rotting, where the opposite behavior occurs.
arXiv Detail & Related papers (2024-09-09T18:23:07Z) - Multi-agent Multi-armed Bandits with Stochastic Sharable Arm Capacities [69.34646544774161]
We formulate a new variant of multi-player multi-armed bandit (MAB) model, which captures arrival of requests to each arm and the policy of allocating requests to players.
The challenge is how to design a distributed learning algorithm such that players select arms according to the optimal arm pulling profile.
We design an iterative distributed algorithm, which guarantees that players can arrive at a consensus on the optimal arm pulling profile in only M rounds.
arXiv Detail & Related papers (2024-08-20T13:57:00Z) - Multi-Armed Bandits with Abstention [62.749500564313834]
We introduce a novel extension of the canonical multi-armed bandit problem that incorporates an additional strategic element: abstention.
In this enhanced framework, the agent is not only tasked with selecting an arm at each time step, but also has the option to abstain from accepting the instantaneous reward before observing it.
arXiv Detail & Related papers (2024-02-23T06:27:12Z) - A Minimaximalist Approach to Reinforcement Learning from Human Feedback [49.45285664482369]
We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement learning from human feedback.
Our approach is minimalist in that it does not require training a reward model nor unstable adversarial training.
We demonstrate that on a suite of continuous control tasks, we are able to learn significantly more efficiently than reward-model based approaches.
arXiv Detail & Related papers (2024-01-08T17:55:02Z) - Online Learning in Contextual Second-Price Pay-Per-Click Auctions [47.06746975822902]
We study online learning in contextual pay-per-click auctions where at each of the $T$ rounds, the learner receives some context along with a set of ads.
The learner's goal is to minimize her regret, defined as the gap between her total revenue and that of an oracle strategy.
arXiv Detail & Related papers (2023-10-08T07:04:22Z) - Best arm identification in rare events [0.43012765978447565]
A key application of this framework is in online advertising where click rates of advertisements could be a fraction of a single percent and final conversion to sales, while highly profitable, may again be a small fraction of the click rates.
Lately, algorithms for BAI problems have been developed that minimise sample complexity while providing statistical guarantees on the correct arm selection.
We exploit the fact that the reward process for each arm is well approximated by a Compound Poisson process to arrive at algorithms that are faster, with a small increase in sample complexity.
arXiv Detail & Related papers (2023-03-14T04:51:24Z) - Incentivized Bandit Learning with Self-Reinforcing User Preferences [9.233886766950054]
We investigate a new multi-armed bandit (MAB) online learning model that considers real-world phenomena in many recommender systems.
We propose two MAB policies termed "At-Least-$n$ Explore-Then-Commit" and "UCB-List"
We prove that both policies achieve $O(log T)$ expected regret with $O(log T)$ expected payment over a time horizon $T$.
arXiv Detail & Related papers (2021-05-19T01:06:32Z) - Combinatorial Bandits without Total Order for Arms [52.93972547896022]
We present a reward model that captures set-dependent reward distribution and assumes no total order for arms.
We develop a novel regret analysis and show an $Oleft(frack2 n log Tepsilonright)$ gap-dependent regret bound as well as an $Oleft(k2sqrtn T log Tright)$ gap-independent regret bound.
arXiv Detail & Related papers (2021-03-03T23:08:59Z) - Combinatorial Bandits under Strategic Manipulations [25.882801456026584]
We study the problem of multi-armed bandits (CMAB) under strategic manipulations of rewards, where each arm can modify the emitted reward signals for its own interest.
Our setting elaborates a more realistic model of adaptive arms that imposes relaxed assumptions compared to adversarial corruptions and adversarial attacks.
arXiv Detail & Related papers (2021-02-25T07:57:27Z) - Near Optimal Adversarial Attacks on Stochastic Bandits and Defenses with Smoothed Responses [2.296475290901356]
Two sets of results are presented in this paper.
I design attack strategies against UCB and Thompson Sampling that only spend $widehatO(sqrtlog T)$ cost.
Inspired by literature on smoothed analysis and behavioral economics, I present two simple algorithms that achieve a competitive ratio arbitrarily close to 1.
arXiv Detail & Related papers (2020-08-21T05:23:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.