Offline Evaluation of Reward-Optimizing Recommender Systems: The Case of
Simulation
- URL: http://arxiv.org/abs/2209.08642v1
- Date: Sun, 18 Sep 2022 20:03:32 GMT
- Title: Offline Evaluation of Reward-Optimizing Recommender Systems: The Case of
Simulation
- Authors: Imad Aouali, Amine Benhalloum, Martin Bompaire, Benjamin Heymann,
Olivier Jeunen, David Rohde, Otmane Sakhi and Flavian Vasile
- Abstract summary: In academic and industry-based research, online evaluation methods are seen as the golden standard for interactive applications like recommendation systems.
Online evaluation methods are costly for a number of reasons, and a clear need remains for reliable offline evaluation procedures.
In academic work, limited access to online systems makes offline metrics the de facto approach to validating novel methods.
- Score: 11.940733431087102
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Both in academic and industry-based research, online evaluation methods are
seen as the golden standard for interactive applications like recommendation
systems. Naturally, the reason for this is that we can directly measure utility
metrics that rely on interventions, being the recommendations that are being
shown to users. Nevertheless, online evaluation methods are costly for a number
of reasons, and a clear need remains for reliable offline evaluation
procedures. In industry, offline metrics are often used as a first-line
evaluation to generate promising candidate models to evaluate online. In
academic work, limited access to online systems makes offline metrics the de
facto approach to validating novel methods. Two classes of offline metrics
exist: proxy-based methods, and counterfactual methods. The first class is
often poorly correlated with the online metrics we care about, and the latter
class only provides theoretical guarantees under assumptions that cannot be
fulfilled in real-world environments. Here, we make the case that
simulation-based comparisons provide ways forward beyond offline metrics, and
argue that they are a preferable means of evaluation.
Related papers
- Online and Offline Evaluations of Collaborative Filtering and Content Based Recommender Systems [0.0]
This study provides a comparative analysis of a large-scale recommender system operating in Iran.
The system employs user-based and item-based recommendations using content-based, collaborative filtering, trend-based methods, and hybrid approaches.
Our methods of evaluation include manual evaluation, offline tests including accuracy and ranking metrics like hit-rate@k and nDCG, and online tests consisting of click-through rate (CTR)
arXiv Detail & Related papers (2024-11-02T20:05:31Z) - Bayesian Design Principles for Offline-to-Online Reinforcement Learning [50.97583504192167]
offline-to-online fine-tuning is crucial for real-world applications where exploration can be costly or unsafe.
In this paper, we tackle the dilemma of offline-to-online fine-tuning: if the agent remains pessimistic, it may fail to learn a better policy, while if it becomes optimistic directly, performance may suffer from a sudden drop.
We show that Bayesian design principles are crucial in solving such a dilemma.
arXiv Detail & Related papers (2024-05-31T16:31:07Z) - Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric.
We propose a single framework built on their equivalence in learning scenarios.
Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z) - Bridging Offline-Online Evaluation with a Time-dependent and Popularity
Bias-free Offline Metric for Recommenders [3.130722489512822]
We show that penalizing popular items and considering the time of transactions significantly improves our ability to choose the best recommendation model for a live recommender system.
Our results aim to help the academic community to understand better offline evaluation and optimization criteria that are more relevant for real applications of recommender systems.
arXiv Detail & Related papers (2023-08-14T01:37:02Z) - Real-Time Evaluation in Online Continual Learning: A New Hope [104.53052316526546]
We evaluate current Continual Learning (CL) methods with respect to their computational costs.
A simple baseline outperforms state-of-the-art CL methods under this evaluation.
This surprisingly suggests that the majority of existing CL literature is tailored to a specific class of streams that is not practical.
arXiv Detail & Related papers (2023-02-02T12:21:10Z) - Do Offline Metrics Predict Online Performance in Recommender Systems? [79.48653445643865]
We investigate the extent to which offline metrics predict online performance by evaluating recommenders across six simulated environments.
We observe that offline metrics are correlated with online performance over a range of environments.
We study the impact of adding exploration strategies, and observe that their effectiveness, when compared to greedy recommendation, is highly dependent on the recommendation algorithm.
arXiv Detail & Related papers (2020-11-07T01:41:13Z) - Modeling Online Behavior in Recommender Systems: The Importance of
Temporal Context [30.894950420437926]
We show how omitting temporal context when evaluating recommender system performance leads to false confidence.
We propose a training procedure to further embed the temporal context in existing models.
Results show that including our temporal objective can improve recall@20 by up to 20%.
arXiv Detail & Related papers (2020-09-19T19:36:43Z) - PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative
Dialogue Systems [48.99561874529323]
There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems.
Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective.
We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
arXiv Detail & Related papers (2020-04-06T04:36:33Z) - AliExpress Learning-To-Rank: Maximizing Online Model Performance without
Going Online [60.887637616379926]
This paper proposes an evaluator-generator framework for learning-to-rank.
It consists of an evaluator that generalizes to evaluate recommendations involving the context, and a generator that maximizes the evaluator score by reinforcement learning.
Our method achieves a significant improvement in terms of Conversion Rate (CR) over the industrial-level fine-tuned model in online A/B tests.
arXiv Detail & Related papers (2020-03-25T10:27:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.