Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation
- URL: http://arxiv.org/abs/2508.13144v1
- Date: Mon, 18 Aug 2025 17:56:04 GMT
- Title: Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation
- Authors: David Heineman, Valentin Hofmann, Ian Magnusson, Yuling Gu, Noah A. Smith, Hannaneh Hajishirzi, Kyle Lo, Jesse Dodge,
- Abstract summary: We introduce two key metrics that show differences in current benchmarks.<n>We demonstrate that benchmarks with a better signal-to-noise ratio are more reliable when making decisions at small scale.<n>We conclude by recommending that those creating new benchmarks, or selecting which existing benchmarks to use, aim for high signal and low noise.
- Score: 103.66549325018741
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Developing large language models is expensive and involves making decisions with small experiments, typically by evaluating on large, multi-task evaluation suites. In this work, we analyze specific properties which make a benchmark more reliable for such decisions, and interventions to design higher-quality evaluation benchmarks. We introduce two key metrics that show differences in current benchmarks: signal, a benchmark's ability to separate better models from worse models, and noise, a benchmark's sensitivity to random variability between training steps. We demonstrate that benchmarks with a better signal-to-noise ratio are more reliable when making decisions at small scale, and those with less noise have lower scaling law prediction error. These results suggest that improving signal or noise will lead to more useful benchmarks, so we introduce three interventions designed to directly affect signal or noise. For example, we propose that switching to a metric that has better signal and noise (e.g., perplexity rather than accuracy) leads to better reliability and improved scaling law error. We also find that filtering noisy subtasks, to improve an aggregate signal-to-noise ratio, leads to more reliable multi-task evaluations. We also find that averaging the output of a model's intermediate checkpoints to reduce noise leads to consistent improvements. We conclude by recommending that those creating new benchmarks, or selecting which existing benchmarks to use, aim for high signal and low noise. We use 30 benchmarks for these experiments, and 375 open-weight language models from 60M to 32B parameters, resulting in a new, publicly available dataset of 900K evaluation benchmark results, totaling 200M instances.
Related papers
- AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition [72.24180896265192]
We introduce AgentNoiseBench, a framework for evaluating robustness of agentic models under noisy environments.<n>We first conduct an in-depth analysis of biases and uncertainties in real-world scenarios.<n>We then categorize environmental noise into two primary types: user-noise and tool-noise.<n>Building on this analysis, we develop an automated pipeline that injects controllable noise into existing agent-centric benchmarks.
arXiv Detail & Related papers (2026-02-11T20:33:10Z) - Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals [18.612081365101464]
We develop a framework that combines standard labeled outcomes with pairwise comparison signals obtained by having models judge auxiliary reasoning chains.<n>Across simulations, our one-step estimator substantially improves ranking accuracy with gains increasing as model output noise grows.<n>Experiments on GPQA Diamond, AIME 2025 and GSM8K further demonstrate more precise performance estimation and more reliable model rankings.
arXiv Detail & Related papers (2026-02-03T03:40:01Z) - Learning More from Less: Unlocking Internal Representations for Benchmark Compression [37.69575776639016]
We introduce REPCORE, which aligns heterogeneous hidden states into a unified latent space to construct representative coresets.<n>Experiments on five benchmarks and over 200 models show consistent gains over output-based baselines in ranking correlation and estimation accuracy.
arXiv Detail & Related papers (2026-01-31T13:11:39Z) - The Flaw of Averages: Quantifying Uniformity of Performance on Benchmarks [32.00464870277127]
We study benchmark reliability from a distributional perspective and introduce benchmark harmony.<n>High harmony is a desirable benchmark property, indicating that the aggregate metric reflects uniform competence across models.<n>By recommending that harmony should be reported alongside accuracy, we reframe evaluation from simple performance averages to a more robust, distributionally reliable measurement of performance.
arXiv Detail & Related papers (2025-09-30T02:14:30Z) - Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination [67.67725938962798]
Pre-training on massive web-scale corpora leaves Qwen2.5 susceptible to data contamination in widely used benchmarks.<n>We introduce a generator that creates fully clean arithmetic problems of arbitrary length and difficulty, dubbed RandomCalculation.<n>We show that only accurate reward signals yield steady improvements that surpass the base model's performance boundary.
arXiv Detail & Related papers (2025-07-14T17:55:15Z) - Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis [10.133537818749291]
Large language models (LLMs) have demonstrated significant utilities in real-world applications.<n> Benchmark evaluations are crucial for assessing the capabilities of LLMs.
arXiv Detail & Related papers (2025-02-13T03:43:33Z) - Negative Pre-aware for Noisy Cross-modal Matching [46.5591267410225]
Cross-modal noise-robust learning is a challenging task since noisy correspondence is hard to recognize and rectify.
We present a novel Negative Pre-aware Cross-modal matching solution for large visual-language model fine-tuning on noisy downstream tasks.
arXiv Detail & Related papers (2023-12-10T05:52:36Z) - Efficient Benchmarking of Language Models [22.696230279151166]
We present the problem of Efficient Benchmarking, namely, intelligently reducing the costs of LM evaluation without compromising reliability.
Using the HELM benchmark as a test case, we investigate how different benchmark design choices affect the computation-reliability trade-off.
We propose an evaluation algorithm, that, when applied to the HELM benchmark, leads to dramatic cost savings with minimal loss of benchmark reliability.
arXiv Detail & Related papers (2023-08-22T17:59:30Z) - Reducing Variance in Temporal-Difference Value Estimation via Ensemble
of Deep Networks [109.59988683444986]
MeanQ is a simple ensemble method that estimates target values as ensemble means.
We show that MeanQ shows remarkable sample efficiency in experiments on the Atari Learning Environment benchmark.
arXiv Detail & Related papers (2022-09-16T01:47:36Z) - Analyzing the Impact of Undersampling on the Benchmarking and
Configuration of Evolutionary Algorithms [3.967483941966979]
We show that care should be taken when making decisions based on limited data.
We show examples of performance losses of more than 20%, even when using statistical races to dynamically adjust the number of runs.
arXiv Detail & Related papers (2022-04-20T09:53:59Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Learning based signal detection for MIMO systems with unknown noise
statistics [84.02122699723536]
This paper aims to devise a generalized maximum likelihood (ML) estimator to robustly detect signals with unknown noise statistics.
In practice, there is little or even no statistical knowledge on the system noise, which in many cases is non-Gaussian, impulsive and not analyzable.
Our framework is driven by an unsupervised learning approach, where only the noise samples are required.
arXiv Detail & Related papers (2021-01-21T04:48:15Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.