Related papers: Quantifying Variance in Evaluation Benchmarks

Quantifying Variance in Evaluation Benchmarks

URL: http://arxiv.org/abs/2406.10229v1
Date: Fri, 14 Jun 2024 17:59:54 GMT
Title: Quantifying Variance in Evaluation Benchmarks
Authors: Lovish Madaan, Aaditya K. Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sharan Narang, Dieuwke Hupkes,
Abstract summary: We measure variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training. We find that simple changes, such as framing choice tasks as completion tasks, can often reduce variance for smaller scale. More involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance.
Score: 34.12254884944099
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluation benchmarks are the cornerstone of measuring capabilities of large language models (LLMs), as well as driving progress in said capabilities. Originally designed to make claims about capabilities (or lack thereof) in fully pretrained models, evaluation benchmarks are now also extensively used to decide between various training choices. Despite this widespread usage, we rarely quantify the variance in our evaluation benchmarks, which dictates whether differences in performance are meaningful. Here, we define and measure a range of metrics geared towards measuring variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training. By studying a large number of models -- both openly available and pretrained from scratch -- we provide empirical estimates for a variety of variance metrics, with considerations and recommendations for practitioners. We also evaluate the utility and tradeoffs of continuous versus discrete performance measures and explore options for better understanding and reducing this variance. We find that simple changes, such as framing choice tasks (like MMLU) as completion tasks, can often reduce variance for smaller scale ($\sim$7B) models, while more involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance. Overall, our work provides insights into variance in evaluation benchmarks, suggests LM-specific techniques to reduce variance, and more generally encourages practitioners to carefully factor in variance when comparing models.

Related papers

From tests to effect sizes: Quantifying uncertainty and statistical variability in multilingual and multitask NLP evaluation benchmarks [11.85366307281236]
We show how experimental variation in performance scores arises from both model- and data-related sources.<n>We also demonstrate how resampling methods are useful for computing sampling distributions for various quantities used in leaderboards.
arXiv Detail & Related papers (2025-09-26T17:37:55Z)
Fluid Language Model Benchmarking [126.92394365620525]
We introduce Fluid Benchmarking, a new evaluation approach that advances LM benchmarking across multiple dimensions.<n>Inspired by psychometrics, Fluid Benchmarking is based on the insight that the relative value of benchmark items depends on an LM's capability level.<n>We examine four dimensions -- efficiency, validity, variance, and saturation -- and find that Fluid Benchmarking achieves superior performance in all of them.
arXiv Detail & Related papers (2025-09-14T05:49:42Z)
Absolute Evaluation Measures for Machine Learning: A Survey [0.0]
This survey provides an overview of absolute evaluation metrics in Machine Learning.<n>It is organized by the type of learning problem and covers clustering, regression, and ranking metrics.<n>It aims to equip practitioners with the tools necessary to select appropriate metrics for their models.
arXiv Detail & Related papers (2025-07-04T08:53:08Z)
Practical Improvements of A/B Testing with Off-Policy Estimation [51.25970890274447]
We introduce a family of unbiased off-policy estimators that achieves lower variance than the standard approach.<n>Our theoretical analysis and experimental results validate the effectiveness and practicality of the proposed method.
arXiv Detail & Related papers (2025-06-12T13:11:01Z)
Quantifying Uncertainty and Variability in Machine Learning: Confidence Intervals for Quantiles in Performance Metric Distributions [0.17265013728931003]
Machine learning models are widely used in applications where reliability and robustness are critical. Model evaluation often relies on single-point estimates of performance metrics that fail to capture the inherent variability in model performance. This contribution explores the use of quantiles and confidence intervals to analyze such distributions, providing a more complete understanding of model performance and its uncertainty.
arXiv Detail & Related papers (2025-01-28T13:21:34Z)
Area under the ROC Curve has the Most Consistent Evaluation for Binary Classification [3.1850615666574806]
This study investigates how consistent different metrics are at evaluating models across data of different prevalence. I find that evaluation metrics that are less influenced by prevalence offer more consistent evaluation of individual models and more consistent ranking of a set of models.
arXiv Detail & Related papers (2024-08-19T17:52:38Z)
OLMES: A Standard for Language Model Evaluations [64.85905119836818]
We propose OLMES, a practical, open standard for reproducible language model evaluations. We identify and review the varying factors in evaluation practices adopted by the community. OLMES supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions.
arXiv Detail & Related papers (2024-06-12T17:37:09Z)
Beyond ELBOs: A Large-Scale Evaluation of Variational Methods for Sampling [14.668634411361307]
We introduce a benchmark that evaluates sampling methods using a standardized task suite and a broad range of performance criteria. We study existing metrics for quantifying mode collapse and introduce novel metrics for this purpose.
arXiv Detail & Related papers (2024-06-11T16:23:33Z)
Rethinking Classifier Re-Training in Long-Tailed Recognition: A Simple Logits Retargeting Approach [102.0769560460338]
We develop a simple logits approach (LORT) without the requirement of prior knowledge of the number of samples per class. Our method achieves state-of-the-art performance on various imbalanced datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist 2018.
arXiv Detail & Related papers (2024-03-01T03:27:08Z)
Calibration-then-Calculation: A Variance Reduced Metric Framework in Deep Click-Through Rate Prediction Models [16.308958212406583]
There is a lack of focus on evaluating the performance of deep learning pipelines. With the increased use of large datasets and complex models, the training process is run only once and the result is compared to previous benchmarks. Traditional solutions, such as running the training process multiple times, are often infeasible due to computational constraints. We introduce a novel metric framework, the Calibrated Loss Metric, designed to address this issue by reducing the variance present in its conventional counterpart.
arXiv Detail & Related papers (2024-01-30T02:38:23Z)
F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic. For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z)
Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs. We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z)
Accounting for multiplicity in machine learning benchmark performance [0.0]
Using the highest-ranked performance as an estimate for state-of-the-art (SOTA) performance is a biased estimator, giving overly optimistic results. In this article, we provide a probability distribution for the case of multiple classifiers so that known analyses methods can be engaged and a better SOTA estimate can be provided.
arXiv Detail & Related papers (2023-03-10T10:32:18Z)
A Statistical Analysis of Summarization Evaluation Metrics using Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are. Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z)
Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task. The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them. By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.