Related papers: Evaluating Online Bandit Exploration In Large-Scale Recommender System

Evaluating Online Bandit Exploration In Large-Scale Recommender System

URL: http://arxiv.org/abs/2304.02572v3
Date: Sun, 30 Jul 2023 08:29:55 GMT
Title: Evaluating Online Bandit Exploration In Large-Scale Recommender System
Authors: Hongbo Guo, Ruben Naeff, Alex Nikulkov, Zheqing Zhu
Abstract summary: Bandit learning has been an increasingly popular design choice for recommender system. One major bottleneck is how to test the effectiveness of bandit algorithm with fairness and without data leakage.
Score: 0.7981257687111937
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Bandit learning has been an increasingly popular design choice for recommender system. Despite the strong interest in bandit learning from the community, there remains multiple bottlenecks that prevent many bandit learning approaches from productionalization. One major bottleneck is how to test the effectiveness of bandit algorithm with fairness and without data leakage. Different from supervised learning algorithms, bandit learning algorithms emphasize greatly on the data collection process through their explorative nature. Such explorative behavior may induce unfair evaluation in a classic A/B test setting. In this work, we apply upper confidence bound (UCB) to our large scale short video recommender system and present a test framework for the production bandit learning life-cycle with a new set of metrics. Extensive experiment results show that our experiment design is able to fairly evaluate the performance of bandit learning in the recommender system.

Related papers

Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms. We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z)
Jump Starting Bandits with LLM-Generated Prior Knowledge [5.344012058238259]
We show that Large Language Models can jump-start contextual multi-armed bandits to reduce online learning regret. We propose an algorithm for contextual bandits by prompting LLMs to produce a pre-training dataset of approximate human preferences for the bandit.
arXiv Detail & Related papers (2024-06-27T16:52:19Z)
SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation. In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor. Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
arXiv Detail & Related papers (2022-03-18T16:50:38Z)
Deep Exploration for Recommendation Systems [14.937000494745861]
We develop deep exploration methods for recommendation systems. In particular, we formulate recommendation as a sequential decision problem. Our experiments are carried out with high-fidelity industrial-grade simulators.
arXiv Detail & Related papers (2021-09-26T06:54:26Z)
Multiclass Classification using dilute bandit feedback [8.452237741722726]
We propose an algorithm for multiclass classification using dilute bandit feedback (MC-DBF) We show that the proposed algorithm achieves O(T1-frac1m+2) mistake bound if candidate label set size (in each step) is m.
arXiv Detail & Related papers (2021-05-17T18:05:34Z)
An empirical evaluation of active inference in multi-armed bandits [0.0]
The active inference framework is distinguished by its sophisticated strategy for resolving the exploration-exploitation trade-off. We derive an efficient and approximate scalable active inference algorithm and compare it to two state-of-the-art bandit algorithms.
arXiv Detail & Related papers (2021-01-21T16:20:06Z)
Lifelong Learning in Multi-Armed Bandits [22.301793734117805]
We study the problem in the multi-armed bandit framework with the objective to minimize the total regret incurred over a series of tasks. While most bandit algorithms are designed to have a low worst-case regret, we examine here the average regret over bandit instances drawn from some prior distribution which may change over time.
arXiv Detail & Related papers (2020-12-28T15:13:31Z)
Instance-Dependent Complexity of Contextual Bandits and Reinforcement Learning: A Disagreement-Based Perspective [104.67295710363679]
In the classical multi-armed bandit problem, instance-dependent algorithms attain improved performance on "easy" problems with a gap between the best and second-best arm. We introduce a family of complexity measures that are both sufficient and necessary to obtain instance-dependent regret bounds. We then introduce new oracle-efficient algorithms which adapt to the gap whenever possible, while also attaining the minimax rate in the worst case.
arXiv Detail & Related papers (2020-10-07T01:33:06Z)
Learning from eXtreme Bandit Feedback [105.0383130431503]
We study the problem of batch learning from bandit feedback in the setting of extremely large action spaces. In this paper, we introduce a selective importance sampling estimator (sIS) that operates in a significantly more favorable bias-variance regime. We employ this estimator in a novel algorithmic procedure -- named Policy Optimization for eXtreme Models (POXM) -- for learning from bandit feedback on XMC tasks.
arXiv Detail & Related papers (2020-09-27T20:47:25Z)
Partial Bandit and Semi-Bandit: Making the Most Out of Scarce Users' Feedback [62.997667081978825]
We present a novel approach for considering user feedback and evaluate it using three distinct strategies. Despite a limited number of feedbacks returned by users (as low as 20% of the total), our approach obtains similar results to those of state of the art approaches.
arXiv Detail & Related papers (2020-09-16T07:32:51Z)
Meta-learning with Stochastic Linear Bandits [120.43000970418939]
We consider a class of bandit algorithms that implement a regularized version of the well-known OFUL algorithm, where the regularization is a square euclidean distance to a bias vector. We show both theoretically and experimentally, that when the number of tasks grows and the variance of the task-distribution is small, our strategies have a significant advantage over learning the tasks in isolation.
arXiv Detail & Related papers (2020-05-18T08:41:39Z)
Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data. Can we learn effective policies via supervised learning without demonstrations? We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.