Measuring Hypothesis Testing Errors in the Evaluation of Retrieval Systems
- URL: http://arxiv.org/abs/2507.07924v1
- Date: Thu, 10 Jul 2025 17:06:24 GMT
- Title: Measuring Hypothesis Testing Errors in the Evaluation of Retrieval Systems
- Authors: Jack McKechnie, Graham McDonald, Craig Macdonald,
- Abstract summary: Evaluation of Information Retrieval systems typically uses query-document pairs with corresponding human-labelled relevance assessments (qrels)<n>Discriminative power, i.e. the ability to correctly identify significant differences between systems, is important for drawing accurate conclusions on the robustness of qrels.<n>We quantify Type II errors and propose that balanced classification metrics, such as balanced accuracy, can be used to portray the discriminative power of qrels.
- Score: 12.434785821674055
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The evaluation of Information Retrieval (IR) systems typically uses query-document pairs with corresponding human-labelled relevance assessments (qrels). These qrels are used to determine if one system is better than another based on average retrieval performance. Acquiring large volumes of human relevance assessments is expensive. Therefore, more efficient relevance assessment approaches have been proposed, necessitating comparisons between qrels to ascertain their efficacy. Discriminative power, i.e. the ability to correctly identify significant differences between systems, is important for drawing accurate conclusions on the robustness of qrels. Previous work has measured the proportion of pairs of systems that are identified as significantly different and has quantified Type I statistical errors. Type I errors lead to incorrect conclusions due to false positive significance tests. We argue that also identifying Type II errors (false negatives) is important as they lead science in the wrong direction. We quantify Type II errors and propose that balanced classification metrics, such as balanced accuracy, can be used to portray the discriminative power of qrels. We perform experiments using qrels generated using alternative relevance assessment methods to investigate measuring hypothesis testing errors in IR evaluation. We find that additional insights into the discriminative power of qrels can be gained by quantifying Type II errors, and that balanced classification metrics can be used to give an overall summary of discriminative power in one, easily comparable, number.
Related papers
- Practical Improvements of A/B Testing with Off-Policy Estimation [51.25970890274447]
We introduce a family of unbiased off-policy estimators that achieves lower variance than the standard approach.<n>Our theoretical analysis and experimental results validate the effectiveness and practicality of the proposed method.
arXiv Detail & Related papers (2025-06-12T13:11:01Z) - Algorithmic Accountability in Small Data: Sample-Size-Induced Bias Within Classification Metrics [0.0]
We show the significance of sample-size bias in classification metrics.<n>This revelation challenges the efficacy of these metrics in assessing bias with high resolution.<n>We propose a model-agnostic assessment and correction technique.
arXiv Detail & Related papers (2025-05-06T22:02:53Z) - Variations in Relevance Judgments and the Shelf Life of Test Collections [50.060833338921945]
We reproduce prior work in the neural retrieval setting, showing that assessor disagreement does not affect system rankings.<n>We observe that some models substantially degrade with our new relevance judgments, and some have already reached the effectiveness of humans as rankers.
arXiv Detail & Related papers (2025-02-28T10:46:56Z) - Towards Reliable Testing for Multiple Information Retrieval System Comparisons [2.9180406633632523]
We use a new approach to assess the reliability of multiple comparison procedures using simulated and real TREC data.<n>Experiments show that Wilcoxon plus the Benjamini-Hochberg correction yields Type I error rates according to the significance level for typical sample sizes.
arXiv Detail & Related papers (2025-01-07T16:48:21Z) - Comprehensive Equity Index (CEI): Definition and Application to Bias Evaluation in Biometrics [47.762333925222926]
We present a novel metric to quantify biased behaviors of machine learning models.
We focus on and apply it to the operational evaluation of face recognition systems.
arXiv Detail & Related papers (2024-09-03T14:19:38Z) - $F_β$-plot -- a visual tool for evaluating imbalanced data classifiers [0.0]
The paper proposes a simple approach to analyzing the popular parametric metric $F_beta$.
It is possible to indicate for a given pool of analyzed classifiers when a given model should be preferred depending on user requirements.
arXiv Detail & Related papers (2024-04-11T18:07:57Z) - Query Performance Prediction using Relevance Judgments Generated by Large Language Models [53.97064615557883]
We propose a new Query performance prediction (QPP) framework using automatically generated relevance judgments (QPP-GenRE)<n>QPP-GenRE decomposes QPP into independent subtasks of predicting relevance of each item in a ranked list to a given query.<n>We predict an item's relevance by using open-source large language models (LLMs) to ensure scientific relevance.
arXiv Detail & Related papers (2024-04-01T09:33:05Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - Information-Theoretic Bias Assessment Of Learned Representations Of
Pretrained Face Recognition [18.07966649678408]
We propose an information-theoretic, independent bias assessment metric to identify degree of bias against protected demographic attributes.
Our metric differs from other methods that rely on classification accuracy or examine the differences between ground truth and predicted labels of protected attributes predicted using a shallow network.
arXiv Detail & Related papers (2021-11-08T17:41:17Z) - Identifying Incorrect Classifications with Balanced Uncertainty [21.130311978327196]
Uncertainty estimation is critical for cost-sensitive deep-learning applications.
We propose the distributional imbalance to model the imbalance in uncertainty estimation as two kinds of distribution biases.
We then propose Balanced True Class Probability framework, which learns an uncertainty estimator with a novel Distributional Focal Loss objective.
arXiv Detail & Related papers (2021-10-15T11:52:31Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.