On Estimating Recommendation Evaluation Metrics under Sampling
- URL: http://arxiv.org/abs/2103.01474v2
- Date: Wed, 3 Mar 2021 06:04:29 GMT
- Title: On Estimating Recommendation Evaluation Metrics under Sampling
- Authors: Ruoming Jin and Dong Li and Benjamin Mudrak and Jing Gao and Zhi Liu
- Abstract summary: There is still a lack of understanding and consensus on how sampling should be used for recommendation evaluation.
In this paper, we introduce a new research problem on learning the empirical rank distribution, and a new approach based on the estimated rank distribution, to estimate the top-k metrics.
- Score: 21.74579327147525
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Since the recent study (Krichene and Rendle 2020) done by Krichene and Rendle
on the sampling-based top-k evaluation metric for recommendation, there has
been a lot of debates on the validity of using sampling to evaluate
recommendation algorithms. Though their work and the recent work (Li et
al.2020) have proposed some basic approaches for mapping the sampling-based
metrics to their global counterparts which rank the entire set of items, there
is still a lack of understanding and consensus on how sampling should be used
for recommendation evaluation. The proposed approaches either are rather
uninformative (linking sampling to metric evaluation) or can only work on
simple metrics, such as Recall/Precision (Krichene and Rendle 2020; Li et al.
2020). In this paper, we introduce a new research problem on learning the
empirical rank distribution, and a new approach based on the estimated rank
distribution, to estimate the top-k metrics. Since this question is closely
related to the underlying mechanism of sampling for recommendation, tackling it
can help better understand the power of sampling and can help resolve the
questions of if and how should we use sampling for evaluating recommendation.
We introduce two approaches based on MLE (MaximalLikelihood Estimation) and its
weighted variants, and ME(Maximal Entropy) principals to recover the empirical
rank distribution, and then utilize them for metrics estimation. The
experimental results show the advantages of using the new approaches for
evaluating recommendation algorithms based on top-k metrics.
Related papers
- Improved Estimation of Ranks for Learning Item Recommenders with Negative Sampling [4.316676800486521]
In recommendation systems, there has been a growth in the number of recommendable items.
To lower this cost, it has become common to sample negative items.
In this work, we demonstrate the benefits from correcting the bias introduced by sampling of negatives.
arXiv Detail & Related papers (2024-10-08T21:09:55Z) - Active Evaluation Acquisition for Efficient LLM Benchmarking [18.85604491151409]
We investigate strategies to improve evaluation efficiency by selecting a subset of examples from each benchmark using a learned policy.
Our approach models the dependencies across test examples, allowing accurate prediction of the evaluation outcomes for the remaining examples.
Empirical results demonstrate that our approach significantly reduces the number of evaluation prompts required.
arXiv Detail & Related papers (2024-10-08T12:08:46Z) - Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy.
We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z) - Regression-aware Inference with LLMs [52.764328080398805]
We show that an inference strategy can be sub-optimal for common regression and scoring evaluation metrics.
We propose alternate inference strategies that estimate the Bayes-optimal solution for regression and scoring metrics in closed-form from sampled responses.
arXiv Detail & Related papers (2024-03-07T03:24:34Z) - Are We Wasting Time? A Fast, Accurate Performance Evaluation Framework
for Knowledge Graph Link Predictors [4.31947784387967]
In Knowledge Graphs on a larger scale, the ranking process rapidly becomes heavy.
Previous approaches used random sampling of entities to assess the quality of links predicted or suggested by a method.
We show that this approach has serious limitations since the ranking metrics produced do not properly reflect true outcomes.
We propose a framework that uses relational recommenders to guide the selection of candidates for evaluation.
arXiv Detail & Related papers (2024-01-25T15:44:46Z) - Towards Better Evaluation of Instruction-Following: A Case-Study in
Summarization [9.686937153317809]
We perform a meta-evaluation of a variety of metrics to quantify how accurately they measure the instruction-following abilities of large language models.
Using riSum, we analyze the agreement between evaluation methods and human judgment.
arXiv Detail & Related papers (2023-10-12T15:07:11Z) - Improved Policy Evaluation for Randomized Trials of Algorithmic Resource
Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT.
We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z) - CEREAL: Few-Sample Clustering Evaluation [4.569028973407756]
We focus on the underexplored problem of estimating clustering quality with limited labels.
We introduce CEREAL, a comprehensive framework for few-sample clustering evaluation.
Our results show that CEREAL reduces the area under the absolute error curve by up to 57% compared to the best sampling baseline.
arXiv Detail & Related papers (2022-09-30T19:52:41Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z) - PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative
Dialogue Systems [48.99561874529323]
There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems.
Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective.
We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
arXiv Detail & Related papers (2020-04-06T04:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.