Related papers: On Estimating Recommendation Evaluation Metrics under Sampling

On Estimating Recommendation Evaluation Metrics under Sampling

URL: http://arxiv.org/abs/2103.01474v2
Date: Wed, 3 Mar 2021 06:04:29 GMT
Title: On Estimating Recommendation Evaluation Metrics under Sampling
Authors: Ruoming Jin and Dong Li and Benjamin Mudrak and Jing Gao and Zhi Liu
Abstract summary: There is still a lack of understanding and consensus on how sampling should be used for recommendation evaluation. In this paper, we introduce a new research problem on learning the empirical rank distribution, and a new approach based on the estimated rank distribution, to estimate the top-k metrics.
Score: 21.74579327147525
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Since the recent study (Krichene and Rendle 2020) done by Krichene and Rendle on the sampling-based top-k evaluation metric for recommendation, there has been a lot of debates on the validity of using sampling to evaluate recommendation algorithms. Though their work and the recent work (Li et al.2020) have proposed some basic approaches for mapping the sampling-based metrics to their global counterparts which rank the entire set of items, there is still a lack of understanding and consensus on how sampling should be used for recommendation evaluation. The proposed approaches either are rather uninformative (linking sampling to metric evaluation) or can only work on simple metrics, such as Recall/Precision (Krichene and Rendle 2020; Li et al. 2020). In this paper, we introduce a new research problem on learning the empirical rank distribution, and a new approach based on the estimated rank distribution, to estimate the top-k metrics. Since this question is closely related to the underlying mechanism of sampling for recommendation, tackling it can help better understand the power of sampling and can help resolve the questions of if and how should we use sampling for evaluating recommendation. We introduce two approaches based on MLE (MaximalLikelihood Estimation) and its weighted variants, and ME(Maximal Entropy) principals to recover the empirical rank distribution, and then utilize them for metrics estimation. The experimental results show the advantages of using the new approaches for evaluating recommendation algorithms based on top-k metrics.

Related papers

From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback [36.68929551237421]
We introduce bftextFeedbacker, an evaluation framework that provides comprehensive and fine-grained results.<n>Our project homepage and dataset are available at https://liudan193.io/Feedbacker.
arXiv Detail & Related papers (2025-05-10T16:52:40Z)
Benchmarking LLMs in Recommendation Tasks: A Comparative Evaluation with Conventional Recommenders [27.273217543282215]
We introduce RecBench, which evaluates two primary recommendation tasks, i.e., click-through rate prediction (CTR) and sequential recommendation (SeqRec) Our experiments cover up to 17 large models and are conducted across five diverse datasets from fashion, news, video, books, and music domains. Our findings indicate that LLM-based recommenders outperform conventional recommenders, achieving up to a 5% AUC improvement in the CTR scenario and up to a 170% NDCG@10 improvement in the SeqRec scenario.
arXiv Detail & Related papers (2025-03-07T15:05:23Z)
Improved Estimation of Ranks for Learning Item Recommenders with Negative Sampling [4.316676800486521]
In recommendation systems, there has been a growth in the number of recommendable items. To lower this cost, it has become common to sample negative items. In this work, we demonstrate the benefits from correcting the bias introduced by sampling of negatives.
arXiv Detail & Related papers (2024-10-08T21:09:55Z)
Active Evaluation Acquisition for Efficient LLM Benchmarking [18.85604491151409]
We investigate strategies to improve evaluation efficiency by selecting a subset of examples from each benchmark using a learned policy. Our approach models the dependencies across test examples, allowing accurate prediction of the evaluation outcomes for the remaining examples. Empirical results demonstrate that our approach significantly reduces the number of evaluation prompts required.
arXiv Detail & Related papers (2024-10-08T12:08:46Z)
Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy. We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z)
Regression-aware Inference with LLMs [52.764328080398805]
We show that an inference strategy can be sub-optimal for common regression and scoring evaluation metrics. We propose alternate inference strategies that estimate the Bayes-optimal solution for regression and scoring metrics in closed-form from sampled responses.
arXiv Detail & Related papers (2024-03-07T03:24:34Z)
Are We Wasting Time? A Fast, Accurate Performance Evaluation Framework for Knowledge Graph Link Predictors [4.31947784387967]
In Knowledge Graphs on a larger scale, the ranking process rapidly becomes heavy. Previous approaches used random sampling of entities to assess the quality of links predicted or suggested by a method. We show that this approach has serious limitations since the ranking metrics produced do not properly reflect true outcomes. We propose a framework that uses relational recommenders to guide the selection of candidates for evaluation.
arXiv Detail & Related papers (2024-01-25T15:44:46Z)
Towards Better Evaluation of Instruction-Following: A Case-Study in Summarization [9.686937153317809]
We perform a meta-evaluation of a variety of metrics to quantify how accurately they measure the instruction-following abilities of large language models. Using riSum, we analyze the agreement between evaluation methods and human judgment.
arXiv Detail & Related papers (2023-10-12T15:07:11Z)
Improved Policy Evaluation for Randomized Trials of Algorithmic Resource Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT. We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z)
CEREAL: Few-Sample Clustering Evaluation [4.569028973407756]
We focus on the underexplored problem of estimating clustering quality with limited labels. We introduce CEREAL, a comprehensive framework for few-sample clustering evaluation. Our results show that CEREAL reduces the area under the absolute error curve by up to 57% compared to the best sampling baseline.
arXiv Detail & Related papers (2022-09-30T19:52:41Z)
Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment. Policy gradients for local search are often obtained from random perturbations. We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z)
A Statistical Analysis of Summarization Evaluation Metrics using Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are. Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z)
PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative Dialogue Systems [48.99561874529323]
There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems. Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective. We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
arXiv Detail & Related papers (2020-04-06T04:36:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.