Related papers: The Flaw of Averages: Quantifying Uniformity of Performance on Benchmarks

The Flaw of Averages: Quantifying Uniformity of Performance on Benchmarks

URL: http://arxiv.org/abs/2509.25671v1
Date: Tue, 30 Sep 2025 02:14:30 GMT
Title: The Flaw of Averages: Quantifying Uniformity of Performance on Benchmarks
Authors: Arda Uzunoglu, Tianjian Li, Daniel Khashabi,
Abstract summary: We study benchmark reliability from a distributional perspective and introduce benchmark harmony.<n>High harmony is a desirable benchmark property, indicating that the aggregate metric reflects uniform competence across models.<n>By recommending that harmony should be reported alongside accuracy, we reframe evaluation from simple performance averages to a more robust, distributionally reliable measurement of performance.
Score: 32.00464870277127
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Benchmarks shape scientific conclusions about model capabilities and steer model development. This creates a feedback loop: stronger benchmarks drive better models, and better models demand more discriminative benchmarks. Ensuring benchmark reliability is therefore essential for trustworthy evaluation and meaningful progress. In this work, we study benchmark reliability from a distributional perspective and introduce benchmark harmony, which measures how uniformly a model's performance is distributed across the subdomains of a benchmark. We posit that high harmony is a desirable benchmark property, indicating that the aggregate metric reflects uniform competence across subdomains. Across 19 multiple-choice benchmarks and five model families, we map each benchmark onto a mean-variance plane of harmony computed across models, where high mean and low variance signal more reliable evaluation. Our analysis shows that less harmonious benchmarks can give misleading results, since overall accuracy may be disproportionately influenced by specific subdomains. For instance, ARC-Easy is overwhelmed by questions on Biological Concepts, overshadowing other critical subdomains such as Geography, Physics, Chemistry, and Environmental Science. By recommending that harmony should be reported alongside accuracy, we reframe evaluation from simple performance averages to a more robust, distributionally reliable measurement of performance.

Related papers

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation [85.56193980646981]
We propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following.<n>For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses.<n>Experiments on IF-RewardBench reveal significant deficiencies in current judge models.
arXiv Detail & Related papers (2026-03-05T02:21:17Z)
When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation [80.66788281323414]
We analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers.<n>Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age.<n>Expert-curated benchmarks resist saturation better than crowdsourced ones.
arXiv Detail & Related papers (2026-02-18T16:51:37Z)
Benchmark^2: Systematic Evaluation of LLM Benchmarks [66.2731798872668]
We propose Benchmark2, a comprehensive framework comprising three complementary metrics.<n>We conduct experiments across 15 benchmarks spanning mathematics, reasoning, and knowledge domains.<n>Our analysis reveals significant quality variations among existing benchmarks and demonstrates that selective benchmark construction can achieve comparable evaluation performance.
arXiv Detail & Related papers (2026-01-07T14:59:03Z)
How NOT to benchmark your SITE metric: Beyond Static Leaderboards and Towards Realistic Evaluation [11.33816414982401]
Transferability estimation metrics are used to find a high-performing pre-trained model for a given target task.<n>Despite the growing interest in developing such metrics, the benchmarks used to measure their progress have gone largely unexamined.<n>We argue that the benchmarks on which these metrics are evaluated are fundamentally flawed.
arXiv Detail & Related papers (2025-10-07T20:38:12Z)
The Lie of the Average: How Class Incremental Learning Evaluation Deceives You? [48.83567710215299]
Class Incremental Learning (CIL) requires models to continuously learn new classes without forgetting previously learned ones.<n>We argue that a robust CIL evaluation protocol should accurately characterize and estimate the entire performance distribution.<n>We propose EDGE, an evaluation protocol that adaptively identifies and samples extreme class sequences using inter-task similarity.
arXiv Detail & Related papers (2025-09-26T17:00:15Z)
Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation [103.66549325018741]
We introduce two key metrics that show differences in current benchmarks.<n>We demonstrate that benchmarks with a better signal-to-noise ratio are more reliable when making decisions at small scale.<n>We conclude by recommending that those creating new benchmarks, or selecting which existing benchmarks to use, aim for high signal and low noise.
arXiv Detail & Related papers (2025-08-18T17:56:04Z)
RewardBench 2: Advancing Reward Model Evaluation [71.65938693914153]
Reward models are used throughout the post-training of language models to capture nuanced signals from preference data.<n>The community has begun establishing best practices for evaluating reward models.<n>This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark.
arXiv Detail & Related papers (2025-06-02T17:54:04Z)
Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory [44.886213907135435]
We propose Pseudo-Siamese Network for Item Response Theory (PSN-IRT) for accurate and reliable estimations of item characteristics and model abilities.<n>PSN-IRT can be utilized for accurate and reliable estimations of item characteristics and model abilities.
arXiv Detail & Related papers (2025-05-21T03:24:11Z)
BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices [28.70453947993952]
We develop an assessment framework considering 46 best practices across an AI benchmark's lifecycle and evaluate 24 AI benchmarks against it. We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues.
arXiv Detail & Related papers (2024-11-20T02:38:24Z)
A Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attributions [60.06461883533697]
We first identify a set of fidelity criteria that reliable benchmarks for attribution methods are expected to fulfill.<n>We then introduce a Backdoor-based eXplainable AI benchmark (BackX) that adheres to the desired fidelity criteria.<n>Our analysis also offers insights into defending against neural Trojans by utilizing the attributions.
arXiv Detail & Related papers (2024-05-02T13:48:37Z)
GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models. We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench. GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z)
How not to Lie with a Benchmark: Rearranging NLP Leaderboards [0.0]
We examine popular NLP benchmarks' overall scoring methods and rearrange the models by geometric and harmonic mean. We analyze several popular benchmarks including GLUE, SuperGLUE, XGLUE, and XTREME.
arXiv Detail & Related papers (2021-12-02T15:40:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.