Bridging Offline-Online Evaluation with a Time-dependent and Popularity
Bias-free Offline Metric for Recommenders
- URL: http://arxiv.org/abs/2308.06885v1
- Date: Mon, 14 Aug 2023 01:37:02 GMT
- Title: Bridging Offline-Online Evaluation with a Time-dependent and Popularity
Bias-free Offline Metric for Recommenders
- Authors: Petr Kasalick\'y, Rodrigo Alves, Pavel Kord\'ik
- Abstract summary: We show that penalizing popular items and considering the time of transactions significantly improves our ability to choose the best recommendation model for a live recommender system.
Our results aim to help the academic community to understand better offline evaluation and optimization criteria that are more relevant for real applications of recommender systems.
- Score: 3.130722489512822
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The evaluation of recommendation systems is a complex task. The offline and
online evaluation metrics for recommender systems are ambiguous in their true
objectives. The majority of recently published papers benchmark their methods
using ill-posed offline evaluation methodology that often fails to predict true
online performance. Because of this, the impact that academic research has on
the industry is reduced. The aim of our research is to investigate and compare
the online performance of offline evaluation metrics. We show that penalizing
popular items and considering the time of transactions during the evaluation
significantly improves our ability to choose the best recommendation model for
a live recommender system. Our results, averaged over five large-size
real-world live data procured from recommenders, aim to help the academic
community to understand better offline evaluation and optimization criteria
that are more relevant for real applications of recommender systems.
Related papers
- Online and Offline Evaluations of Collaborative Filtering and Content Based Recommender Systems [0.0]
This study provides a comparative analysis of a large-scale recommender system operating in Iran.
The system employs user-based and item-based recommendations using content-based, collaborative filtering, trend-based methods, and hybrid approaches.
Our methods of evaluation include manual evaluation, offline tests including accuracy and ranking metrics like hit-rate@k and nDCG, and online tests consisting of click-through rate (CTR)
arXiv Detail & Related papers (2024-11-02T20:05:31Z) - Towards Off-Policy Reinforcement Learning for Ranking Policies with
Human Feedback [47.03475305565384]
We propose a new off-policy value ranking (VR) algorithm that can simultaneously maximize user long-term rewards and optimize the ranking metric offline.
We show that the EM process guides the leaned policy to enjoy the benefit of integration of the future reward and ranking metric, and learn without any online interactions.
arXiv Detail & Related papers (2024-01-17T04:19:33Z) - A Comprehensive Survey of Evaluation Techniques for Recommendation
Systems [0.0]
This paper introduces a comprehensive suite of metrics, each tailored to capture a distinct aspect of system performance.
We identify the strengths and limitations of current evaluation practices and highlight the nuanced trade-offs that emerge when optimizing recommendation systems across different metrics.
arXiv Detail & Related papers (2023-12-26T11:57:01Z) - Offline Evaluation of Reward-Optimizing Recommender Systems: The Case of
Simulation [11.940733431087102]
In academic and industry-based research, online evaluation methods are seen as the golden standard for interactive applications like recommendation systems.
Online evaluation methods are costly for a number of reasons, and a clear need remains for reliable offline evaluation procedures.
In academic work, limited access to online systems makes offline metrics the de facto approach to validating novel methods.
arXiv Detail & Related papers (2022-09-18T20:03:32Z) - FEBR: Expert-Based Recommendation Framework for beneficial and
personalized content [77.86290991564829]
We propose FEBR (Expert-Based Recommendation Framework), an apprenticeship learning framework to assess the quality of the recommended content.
The framework exploits the demonstrated trajectories of an expert (assumed to be reliable) in a recommendation evaluation environment, to recover an unknown utility function.
We evaluate the performance of our solution through a user interest simulation environment (using RecSim)
arXiv Detail & Related papers (2021-07-17T18:21:31Z) - Improving Long-Term Metrics in Recommendation Systems using
Short-Horizon Offline RL [56.20835219296896]
We study session-based recommendation scenarios where we want to recommend items to users during sequential interactions to improve their long-term utility.
We develop a new batch RL algorithm called Short Horizon Policy Improvement (SHPI) that approximates policy-induced distribution shifts across sessions.
arXiv Detail & Related papers (2021-06-01T15:58:05Z) - Benchmarks for Deep Off-Policy Evaluation [152.28569758144022]
We present a collection of policies that can be used for benchmarking off-policy evaluation.
The goal of our benchmark is to provide a standardized measure of progress that is motivated from a set of principles.
We provide open-source access to our data and code to foster future research in this area.
arXiv Detail & Related papers (2021-03-30T18:09:33Z) - Do Offline Metrics Predict Online Performance in Recommender Systems? [79.48653445643865]
We investigate the extent to which offline metrics predict online performance by evaluating recommenders across six simulated environments.
We observe that offline metrics are correlated with online performance over a range of environments.
We study the impact of adding exploration strategies, and observe that their effectiveness, when compared to greedy recommendation, is highly dependent on the recommendation algorithm.
arXiv Detail & Related papers (2020-11-07T01:41:13Z) - Modeling Online Behavior in Recommender Systems: The Importance of
Temporal Context [30.894950420437926]
We show how omitting temporal context when evaluating recommender system performance leads to false confidence.
We propose a training procedure to further embed the temporal context in existing models.
Results show that including our temporal objective can improve recall@20 by up to 20%.
arXiv Detail & Related papers (2020-09-19T19:36:43Z) - AliExpress Learning-To-Rank: Maximizing Online Model Performance without
Going Online [60.887637616379926]
This paper proposes an evaluator-generator framework for learning-to-rank.
It consists of an evaluator that generalizes to evaluate recommendations involving the context, and a generator that maximizes the evaluator score by reinforcement learning.
Our method achieves a significant improvement in terms of Conversion Rate (CR) over the industrial-level fine-tuned model in online A/B tests.
arXiv Detail & Related papers (2020-03-25T10:27:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.