On Estimating Recommendation Evaluation Metrics under Sampling
- URL: http://arxiv.org/abs/2103.01474v2
- Date: Wed, 3 Mar 2021 06:04:29 GMT
- Title: On Estimating Recommendation Evaluation Metrics under Sampling
- Authors: Ruoming Jin and Dong Li and Benjamin Mudrak and Jing Gao and Zhi Liu
- Abstract summary: There is still a lack of understanding and consensus on how sampling should be used for recommendation evaluation.
In this paper, we introduce a new research problem on learning the empirical rank distribution, and a new approach based on the estimated rank distribution, to estimate the top-k metrics.
- Score: 21.74579327147525
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Since the recent study (Krichene and Rendle 2020) done by Krichene and Rendle
on the sampling-based top-k evaluation metric for recommendation, there has
been a lot of debates on the validity of using sampling to evaluate
recommendation algorithms. Though their work and the recent work (Li et
al.2020) have proposed some basic approaches for mapping the sampling-based
metrics to their global counterparts which rank the entire set of items, there
is still a lack of understanding and consensus on how sampling should be used
for recommendation evaluation. The proposed approaches either are rather
uninformative (linking sampling to metric evaluation) or can only work on
simple metrics, such as Recall/Precision (Krichene and Rendle 2020; Li et al.
2020). In this paper, we introduce a new research problem on learning the
empirical rank distribution, and a new approach based on the estimated rank
distribution, to estimate the top-k metrics. Since this question is closely
related to the underlying mechanism of sampling for recommendation, tackling it
can help better understand the power of sampling and can help resolve the
questions of if and how should we use sampling for evaluating recommendation.
We introduce two approaches based on MLE (MaximalLikelihood Estimation) and its
weighted variants, and ME(Maximal Entropy) principals to recover the empirical
rank distribution, and then utilize them for metrics estimation. The
experimental results show the advantages of using the new approaches for
evaluating recommendation algorithms based on top-k metrics.
Related papers
- Metric-aware LLM inference for regression and scoring [52.764328080398805]
Large language models (LLMs) have demonstrated strong results on a range of NLP tasks.
We show that this inference strategy can be suboptimal for a range of regression and scoring tasks, and associated evaluation metrics.
We propose aware metric LLM inference: a decision theoretic approach optimizing for custom regression and scoring metrics at inference time.
arXiv Detail & Related papers (2024-03-07T03:24:34Z) - Are We Wasting Time? A Fast, Accurate Performance Evaluation Framework
for Knowledge Graph Link Predictors [4.31947784387967]
In Knowledge Graphs on a larger scale, the ranking process rapidly becomes heavy.
Previous approaches used random sampling of entities to assess the quality of links predicted or suggested by a method.
We show that this approach has serious limitations since the ranking metrics produced do not properly reflect true outcomes.
We propose a framework that uses relational recommenders to guide the selection of candidates for evaluation.
arXiv Detail & Related papers (2024-01-25T15:44:46Z) - Towards Better Evaluation of Instruction-Following: A Case-Study in
Summarization [9.686937153317809]
We perform a meta-evaluation of a variety of metrics to quantify how accurately they measure the instruction-following abilities of large language models.
Using riSum, we analyze the agreement between evaluation methods and human judgment.
arXiv Detail & Related papers (2023-10-12T15:07:11Z) - Improved Policy Evaluation for Randomized Trials of Algorithmic Resource
Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT.
We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z) - CEREAL: Few-Sample Clustering Evaluation [4.569028973407756]
We focus on the underexplored problem of estimating clustering quality with limited labels.
We introduce CEREAL, a comprehensive framework for few-sample clustering evaluation.
Our results show that CEREAL reduces the area under the absolute error curve by up to 57% compared to the best sampling baseline.
arXiv Detail & Related papers (2022-09-30T19:52:41Z) - Sample Efficient Model Evaluation [30.72511219329606]
Given a collection of unlabelled data points, we address how to select which subset to label to best estimate test metrics.
We consider two sampling based approaches, namely the well-known Importance Sampling and we introduce a novel application of Poisson Sampling.
arXiv Detail & Related papers (2021-09-24T16:03:58Z) - A Case Study on Sampling Strategies for Evaluating Neural Sequential
Item Recommendation Models [69.32128532935403]
Two well-known strategies to sample negative items are uniform random sampling and sampling by popularity.
We re-evaluate current state-of-the-art sequential recommender models from the point of view.
We find that both sampling strategies can produce inconsistent rankings compared with the full ranking of the models.
arXiv Detail & Related papers (2021-07-27T19:06:03Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z) - PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative
Dialogue Systems [48.99561874529323]
There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems.
Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective.
We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
arXiv Detail & Related papers (2020-04-06T04:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.