Related papers: Non-Stationary Latent Bandits

Non-Stationary Latent Bandits

URL: http://arxiv.org/abs/2012.00386v1
Date: Tue, 1 Dec 2020 10:31:57 GMT
Title: Non-Stationary Latent Bandits
Authors: Joey Hong, Branislav Kveton, Manzil Zaheer, Yinlam Chow, Amr Ahmed, Mohammad Ghavamzadeh, Craig Boutilier
Abstract summary: We propose a practical approach for fast personalization to non-stationary users. The key idea is to frame this problem as a latent bandit, where prototypical models of user behavior are learned offline and the latent state of the user is inferred online. We propose Thompson sampling algorithms for regret minimization in non-stationary latent bandits, analyze them, and evaluate them on a real-world dataset.
Score: 68.21614490603758
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Users of recommender systems often behave in a non-stationary fashion, due to their evolving preferences and tastes over time. In this work, we propose a practical approach for fast personalization to non-stationary users. The key idea is to frame this problem as a latent bandit, where the prototypical models of user behavior are learned offline and the latent state of the user is inferred online from its interactions with the models. We call this problem a non-stationary latent bandit. We propose Thompson sampling algorithms for regret minimization in non-stationary latent bandits, analyze them, and evaluate them on a real-world dataset. The main strength of our approach is that it can be combined with rich offline-learned models, which can be misspecified, and are subsequently fine-tuned online using posterior sampling. In this way, we naturally combine the strengths of offline and online learning.

Related papers

Epinet for Content Cold Start [14.018820788546535]
epinets enables efficient approximations of Thompson sampling even when the learning model is a complex neural network. Our experiments demonstrate improvements in both user traffic and engagement efficiency on the Facebook Reels online video platform.
arXiv Detail & Related papers (2024-11-20T19:43:27Z)
Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms. We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z)
Online Bandit Learning with Offline Preference Data [15.799929216215672]
We propose a posterior sampling algorithm for online learning that can be warm-started with an offline dataset with noisy preference feedback. We show that by modeling the 'competence' of the expert that generated it, we are able to use such a dataset most effectively.
arXiv Detail & Related papers (2024-06-13T20:25:52Z)
MOTO: Offline Pre-training to Online Fine-tuning for Model-based Robot Learning [52.101643259906915]
We study the problem of offline pre-training and online fine-tuning for reinforcement learning from high-dimensional observations. Existing model-based offline RL methods are not suitable for offline-to-online fine-tuning in high-dimensional domains. We propose an on-policy model-based method that can efficiently reuse prior data through model-based value expansion and policy regularization.
arXiv Detail & Related papers (2024-01-06T21:04:31Z)
Online Matching: A Real-time Bandit System for Large-scale Recommendations [23.954049092470548]
Online Matching is a scalable closed-loop bandit system learning from users' direct feedback on items in real time. Diag-LinUCB is a novel extension of the LinUCB algorithm to enable distributed updates of bandits parameter in a scalable and timely manner.
arXiv Detail & Related papers (2023-07-29T05:46:27Z)
Anytime Model Selection in Linear Bandits [61.97047189786905]
We develop ALEXP, which has an exponentially improved dependence on $M$ for its regret. Our approach utilizes a novel time-uniform analysis of the Lasso, establishing a new connection between online learning and high-dimensional statistics.
arXiv Detail & Related papers (2023-07-24T15:44:30Z)
Adaptive Endpointing with Deep Contextual Multi-armed Bandits [30.13188582607401]
We propose an efficient method for choosing an optimal endpointing configuration given utterance-level audio features in an online setting. Our method does not require ground truth labels, and only uses online learning from reward signals without requiring annotated labels.
arXiv Detail & Related papers (2023-03-23T16:28:26Z)
Offline-to-Online Reinforcement Learning via Balanced Replay and Pessimistic Q-Ensemble [135.6115462399788]
Deep offline reinforcement learning has made it possible to train strong robotic agents from offline datasets. State-action distribution shift may lead to severe bootstrap error during fine-tuning. We propose a balanced replay scheme that prioritizes samples encountered online while also encouraging the use of near-on-policy samples.
arXiv Detail & Related papers (2021-07-01T16:26:54Z)
Modeling Online Behavior in Recommender Systems: The Importance of Temporal Context [30.894950420437926]
We show how omitting temporal context when evaluating recommender system performance leads to false confidence. We propose a training procedure to further embed the temporal context in existing models. Results show that including our temporal objective can improve recall@20 by up to 20%.
arXiv Detail & Related papers (2020-09-19T19:36:43Z)
Deep Bayesian Bandits: Exploring in Online Personalized Recommendations [4.845576821204241]
We formulate a display advertising recommender as a contextual bandit. We implement exploration techniques that require sampling from the posterior distribution of click-through-rates. We test our proposed deep Bayesian bandits algorithm in the offline simulation and online AB setting.
arXiv Detail & Related papers (2020-08-03T08:58:18Z)
Latent Bandits Revisited [55.88616813182679]
A latent bandit problem is one in which the learning agent knows the arm reward distributions conditioned on an unknown discrete latent state. We propose general algorithms for this setting, based on both upper confidence bounds (UCBs) and Thompson sampling. We provide a unified theoretical analysis of our algorithms, which have lower regret than classic bandit policies when the number of latent states is smaller than actions.
arXiv Detail & Related papers (2020-06-15T19:24:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.