Related papers: Reliable Evaluation Protocol for Low-Precision Retrieval

Reliable Evaluation Protocol for Low-Precision Retrieval

URL: http://arxiv.org/abs/2508.03306v2
Date: Wed, 06 Aug 2025 02:48:59 GMT
Title: Reliable Evaluation Protocol for Low-Precision Retrieval
Authors: Kisu Yang, Yoonna Jang, Hwanseok Jang, Kenneth Choi, Isabelle Augenstein, Heuiseok Lim,
Abstract summary: We propose a more robust retrieval evaluation protocol designed to reduce score variation.<n>It consists of: (1) High-Precision Scoring (HPS), which upcasts the final scoring step to higher precision to resolve tied candidates with minimal computational cost; and (2) Tie-aware Retrieval Metrics (TRM), which report expected scores, range, and bias to quantify order uncertainty of tied candidates.
Score: 34.65522226937288
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Lowering the numerical precision of model parameters and computations is widely adopted to improve the efficiency of retrieval systems. However, when computing relevance scores between the query and documents in low-precision, we observe spurious ties due to the reduced granularity. This introduces high variability in the results based on tie resolution, making the evaluation less reliable. To address this, we propose a more robust retrieval evaluation protocol designed to reduce score variation. It consists of: (1) High-Precision Scoring (HPS), which upcasts the final scoring step to higher precision to resolve tied candidates with minimal computational cost; and (2) Tie-aware Retrieval Metrics (TRM), which report expected scores, range, and bias to quantify order uncertainty of tied candidates. Our experiments test multiple models with three scoring functions on two retrieval datasets to demonstrate that HPS dramatically reduces tie-induced instability, and TRM accurately recovers expected metric values. This combination enables a more consistent and reliable evaluation system for lower-precision retrievals.

Related papers

AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking [25.459771464139855]
Listwise reranking with large language models (LLMs) enhances top-ranked results in retrieval-based applications.<n>We propose AcuRank, an adaptive reranking framework that dynamically adjusts both the amount and target of computation based on uncertainty estimates over document relevance.<n>Results on the TREC-DL and BEIR benchmarks show that our method consistently achieves a superior accuracy-efficiency trade-off and scales better with compute than fixed-computation baselines.
arXiv Detail & Related papers (2025-05-24T05:15:49Z)
Breaking the Lens of the Telescope: Online Relevance Estimation over Large Retrieval Sets [14.494301139974455]
We propose a novel paradigm for re-ranking called online relevance estimation.<n>Online relevance estimation continuously updates relevance estimates for a query throughout the ranking process.<n>We validate our approach on TREC benchmarks under two scenarios: hybrid retrieval and adaptive retrieval.
arXiv Detail & Related papers (2025-04-12T22:05:50Z)
Semiparametric conformal prediction [79.6147286161434]
We construct a conformal prediction set accounting for the joint correlation structure of the vector-valued non-conformity scores.<n>We flexibly estimate the joint cumulative distribution function (CDF) of the scores.<n>Our method yields desired coverage and competitive efficiency on a range of real-world regression problems.
arXiv Detail & Related papers (2024-11-04T14:29:02Z)
Bayesian Prediction-Powered Inference [62.2436697657307]
Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data. We propose a framework for PPI based on Bayesian inference that allows researchers to develop new task-appropriate PPI methods easily.
arXiv Detail & Related papers (2024-05-09T18:08:58Z)
Query Performance Prediction using Relevance Judgments Generated by Large Language Models [53.97064615557883]
We propose a new Query performance prediction (QPP) framework using automatically generated relevance judgments (QPP-GenRE)<n>QPP-GenRE decomposes QPP into independent subtasks of predicting relevance of each item in a ranked list to a given query.<n>We predict an item's relevance by using open-source large language models (LLMs) to ensure scientific relevance.
arXiv Detail & Related papers (2024-04-01T09:33:05Z)
Optimal Cross-Validation for Sparse Linear Regression [5.156484100374059]
We use k-fold cross-validation to select sparsity and robustness of linear regressors.<n>Cross-validation substantially increases the computational cost of sparse regression.<n>We improve upon this state of affairs by solving 50-80% fewer mixed-integer optimization problems.
arXiv Detail & Related papers (2023-06-26T17:02:45Z)
Mutual Wasserstein Discrepancy Minimization for Sequential Recommendation [82.0801585843835]
We propose a novel self-supervised learning framework based on Mutual WasserStein discrepancy minimization MStein for the sequential recommendation. We also propose a novel contrastive learning loss based on Wasserstein Discrepancy Measurement.
arXiv Detail & Related papers (2023-01-28T13:38:48Z)
Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations. We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z)
Newer is not always better: Rethinking transferability metrics, their peculiarities, stability and performance [5.650647159993238]
Fine-tuning of large pre-trained image and language models on small customized datasets has become increasingly popular. We show that the statistical problems with covariance estimation drive the poor performance of H-score. We propose a correction and recommend measuring correlation performance against relative accuracy in such settings.
arXiv Detail & Related papers (2021-10-13T17:24:12Z)
Enhanced Doubly Robust Learning for Debiasing Post-click Conversion Rate Estimation [29.27760413892272]
Post-click conversion, as a strong signal indicating the user preference, is salutary for building recommender systems. Currently, most existing methods utilize counterfactual learning to debias recommender systems. We propose a novel double learning approach for the MRDR estimator, which can convert the error imputation into the general CVR estimation.
arXiv Detail & Related papers (2021-05-28T06:59:49Z)
A Statistical Analysis of Summarization Evaluation Metrics using Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are. Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.