Evaluating Online Bandit Exploration In Large-Scale Recommender System
- URL: http://arxiv.org/abs/2304.02572v3
- Date: Sun, 30 Jul 2023 08:29:55 GMT
- Title: Evaluating Online Bandit Exploration In Large-Scale Recommender System
- Authors: Hongbo Guo, Ruben Naeff, Alex Nikulkov, Zheqing Zhu
- Abstract summary: Bandit learning has been an increasingly popular design choice for recommender system.
One major bottleneck is how to test the effectiveness of bandit algorithm with fairness and without data leakage.
- Score: 0.7981257687111937
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Bandit learning has been an increasingly popular design choice for
recommender system. Despite the strong interest in bandit learning from the
community, there remains multiple bottlenecks that prevent many bandit learning
approaches from productionalization. One major bottleneck is how to test the
effectiveness of bandit algorithm with fairness and without data leakage.
Different from supervised learning algorithms, bandit learning algorithms
emphasize greatly on the data collection process through their explorative
nature. Such explorative behavior may induce unfair evaluation in a classic A/B
test setting. In this work, we apply upper confidence bound (UCB) to our large
scale short video recommender system and present a test framework for the
production bandit learning life-cycle with a new set of metrics. Extensive
experiment results show that our experiment design is able to fairly evaluate
the performance of bandit learning in the recommender system.
Related papers
- Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms.
We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z) - Jump Starting Bandits with LLM-Generated Prior Knowledge [5.344012058238259]
We show that Large Language Models can jump-start contextual multi-armed bandits to reduce online learning regret.
We propose an algorithm for contextual bandits by prompting LLMs to produce a pre-training dataset of approximate human preferences for the bandit.
arXiv Detail & Related papers (2024-06-27T16:52:19Z) - Deep Exploration for Recommendation Systems [14.937000494745861]
We develop deep exploration methods for recommendation systems.
In particular, we formulate recommendation as a sequential decision problem.
Our experiments are carried out with high-fidelity industrial-grade simulators.
arXiv Detail & Related papers (2021-09-26T06:54:26Z) - Multiclass Classification using dilute bandit feedback [8.452237741722726]
We propose an algorithm for multiclass classification using dilute bandit feedback (MC-DBF)
We show that the proposed algorithm achieves O(T1-frac1m+2) mistake bound if candidate label set size (in each step) is m.
arXiv Detail & Related papers (2021-05-17T18:05:34Z) - An empirical evaluation of active inference in multi-armed bandits [0.0]
The active inference framework is distinguished by its sophisticated strategy for resolving the exploration-exploitation trade-off.
We derive an efficient and approximate scalable active inference algorithm and compare it to two state-of-the-art bandit algorithms.
arXiv Detail & Related papers (2021-01-21T16:20:06Z) - Lifelong Learning in Multi-Armed Bandits [22.301793734117805]
We study the problem in the multi-armed bandit framework with the objective to minimize the total regret incurred over a series of tasks.
While most bandit algorithms are designed to have a low worst-case regret, we examine here the average regret over bandit instances drawn from some prior distribution which may change over time.
arXiv Detail & Related papers (2020-12-28T15:13:31Z) - Instance-Dependent Complexity of Contextual Bandits and Reinforcement
Learning: A Disagreement-Based Perspective [104.67295710363679]
In the classical multi-armed bandit problem, instance-dependent algorithms attain improved performance on "easy" problems with a gap between the best and second-best arm.
We introduce a family of complexity measures that are both sufficient and necessary to obtain instance-dependent regret bounds.
We then introduce new oracle-efficient algorithms which adapt to the gap whenever possible, while also attaining the minimax rate in the worst case.
arXiv Detail & Related papers (2020-10-07T01:33:06Z) - Learning from eXtreme Bandit Feedback [105.0383130431503]
We study the problem of batch learning from bandit feedback in the setting of extremely large action spaces.
In this paper, we introduce a selective importance sampling estimator (sIS) that operates in a significantly more favorable bias-variance regime.
We employ this estimator in a novel algorithmic procedure -- named Policy Optimization for eXtreme Models (POXM) -- for learning from bandit feedback on XMC tasks.
arXiv Detail & Related papers (2020-09-27T20:47:25Z) - Partial Bandit and Semi-Bandit: Making the Most Out of Scarce Users'
Feedback [62.997667081978825]
We present a novel approach for considering user feedback and evaluate it using three distinct strategies.
Despite a limited number of feedbacks returned by users (as low as 20% of the total), our approach obtains similar results to those of state of the art approaches.
arXiv Detail & Related papers (2020-09-16T07:32:51Z) - Meta-learning with Stochastic Linear Bandits [120.43000970418939]
We consider a class of bandit algorithms that implement a regularized version of the well-known OFUL algorithm, where the regularization is a square euclidean distance to a bias vector.
We show both theoretically and experimentally, that when the number of tasks grows and the variance of the task-distribution is small, our strategies have a significant advantage over learning the tasks in isolation.
arXiv Detail & Related papers (2020-05-18T08:41:39Z) - Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data.
Can we learn effective policies via supervised learning without demonstrations?
We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.