Learning by Repetition: Stochastic Multi-armed Bandits under Priming
Effect
- URL: http://arxiv.org/abs/2006.10356v1
- Date: Thu, 18 Jun 2020 08:27:23 GMT
- Title: Learning by Repetition: Stochastic Multi-armed Bandits under Priming
Effect
- Authors: Priyank Agrawal and Theja Tulabandhula
- Abstract summary: We study the effect of persistence of engagement on learning in a multi-armed bandit setting.
We provide novel algorithms that achieves sublinear regret in time and the relevant wear-in/wear-out parameters.
- Score: 2.5966580648312223
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the effect of persistence of engagement on learning in a stochastic
multi-armed bandit setting. In advertising and recommendation systems,
repetition effect includes a wear-in period, where the user's propensity to
reward the platform via a click or purchase depends on how frequently they see
the recommendation in the recent past. It also includes a counteracting
wear-out period, where the user's propensity to respond positively is dampened
if the recommendation was shown too many times recently. Priming effect can be
naturally modelled as a temporal constraint on the strategy space, since the
reward for the current action depends on historical actions taken by the
platform. We provide novel algorithms that achieves sublinear regret in time
and the relevant wear-in/wear-out parameters. The effect of priming on the
regret upper bound is also additive, and we get back a guarantee that matches
popular algorithms such as the UCB1 and Thompson sampling when there is no
priming effect. Our work complements recent work on modeling time varying
rewards, delays and corruptions in bandits, and extends the usage of rich
behavior models in sequential decision making settings.
Related papers
- Contextual Bandit with Herding Effects: Algorithms and Recommendation Applications [17.865143559133994]
"Herding effects" bias user feedback toward historical ratings, breaking down the assumption of unbiased feedback inherent in contextual bandits.
This paper develops a novel variant of the contextual bandit that is tailored to address the feedback bias caused by the herding effects.
We show that TS-Conf effectively mitigates the negative impact of herding effects, resulting in faster learning and improved recommendation accuracy.
arXiv Detail & Related papers (2024-08-26T17:20:34Z) - Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms.
We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z) - Sparsity-Agnostic Linear Bandits with Adaptive Adversaries [19.84322270472381]
We study linear bandits where, in each round, the learner receives a set of actions (i.e., feature vectors) from which it chooses an element and obtains a reward.
The expected reward is a fixed but unknown linear function of the chosen action.
We study sparse regret bounds, that depend on the number $S$ of non-zero coefficients in the linear reward function.
arXiv Detail & Related papers (2024-06-03T10:54:58Z) - Last Switch Dependent Bandits with Monotone Payoff Functions [8.860629791560198]
We take a step towards understanding the approximability of planning LSD bandits, namely, the (NP-hard) problem of computing an optimal arm-pulling strategy.
In particular, we design the first efficient constant approximation algorithm for the problem and show that, under a natural monotonicity assumption on payoffs, its approximation guarantee almost matches the state-of-the-art.
We develop new tools and insights for this class of problems, including a novel higher-dimensional relaxation and the technique of mirroring the evolution of virtual states.
arXiv Detail & Related papers (2023-06-01T04:38:32Z) - Generative Slate Recommendation with Reinforcement Learning [49.75985313698214]
reinforcement learning algorithms can be used to optimize user engagement in recommender systems.
However, RL approaches are intractable in the slate recommendation scenario.
In that setting, an action corresponds to a slate that may contain any combination of items.
In this work we propose to encode slates in a continuous, low-dimensional latent space learned by a variational auto-encoder.
We are able to (i) relax assumptions required by previous work, and (ii) improve the quality of the action selection by modeling full slates.
arXiv Detail & Related papers (2023-01-20T15:28:09Z) - Reward Imputation with Sketching for Contextual Batched Bandits [48.80803376405073]
Contextual batched bandit (CBB) is a setting where a batch of rewards is observed from the environment at the end of each episode.
Existing approaches for CBB often ignore the rewards of the non-executed actions, leading to underutilization of feedback information.
We propose Sketched Policy Updating with Imputed Rewards (SPUIR) that completes the unobserved rewards using sketching.
arXiv Detail & Related papers (2022-10-13T04:26:06Z) - Effective and Efficient Training for Sequential Recommendation using
Recency Sampling [91.02268704681124]
We propose a novel Recency-based Sampling of Sequences training objective.
We show that the models enhanced with our method can achieve performances exceeding or very close to stateof-the-art BERT4Rec.
arXiv Detail & Related papers (2022-07-06T13:06:31Z) - Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When
Partial Feedback Counts [53.579515853222986]
We study a novel bandit setting, namely Multi-Armed Bandit with Temporally-Partitioned Rewards (TP-MAB)
This setting is a natural extension of delayed-feedback bandits to the case in which rewards may be dilated over a finite-time span after the pull.
We provide two algorithms to address TP-MAB problems, namely, TP-UCB-FR and TP-UCB-EW.
arXiv Detail & Related papers (2022-06-01T15:56:59Z) - Non-Stationary Latent Bandits [68.21614490603758]
We propose a practical approach for fast personalization to non-stationary users.
The key idea is to frame this problem as a latent bandit, where prototypical models of user behavior are learned offline and the latent state of the user is inferred online.
We propose Thompson sampling algorithms for regret minimization in non-stationary latent bandits, analyze them, and evaluate them on a real-world dataset.
arXiv Detail & Related papers (2020-12-01T10:31:57Z) - Rebounding Bandits for Modeling Satiation Effects [22.92512152419544]
We introduce rebounding bandits, a multi-armed bandit setup, where satiation dynamics are modeled as time-invariant linear dynamical systems.
We characterize the planning problem showing that the greedy policy is optimal when arms exhibit identical dynamics.
arXiv Detail & Related papers (2020-11-13T03:17:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.