Related papers: Measuring all the noises of LLM Evals

Measuring all the noises of LLM Evals

URL: http://arxiv.org/abs/2512.21326v1
Date: Wed, 24 Dec 2025 18:54:37 GMT
Title: Measuring all the noises of LLM Evals
Authors: Sida Wang,
Abstract summary: We define and measure three types of noise: prediction noise from generating different answers on a given question, data noise from sampling questions, and their combined total noise following the law of total variance.<n>We propose the all-pairs paired method, which applies the paired analysis to all pairs of LLMs and measures all the noise components based on millions of question-level predictions.
Score: 3.2452410034214303
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Separating signal from noise is central to experimental science. Applying well-established statistical method effectively to LLM evals requires consideration of their unique noise characteristics. We clearly define and measure three types of noise: prediction noise from generating different answers on a given question, data noise from sampling questions, and their combined total noise following the law of total variance. To emphasize relative comparisons and gain statistical power, we propose the all-pairs paired method, which applies the paired analysis to all pairs of LLMs and measures all the noise components based on millions of question-level predictions across many evals and settings. These measurements revealed clear patterns. First, each eval exhibits a characteristic and highly predictable total noise level across all model pairs. Second, paired prediction noise typically exceeds paired data noise, which means reducing prediction noise by averaging can significantly increase statistical power. These findings enable practitioners to assess significance without custom testing and to detect much smaller effects in controlled experiments.

Related papers

Bayesian inference of general noise-model parameters from the syndrome statistics of surface codes [0.0]
Noise model estimation based on syndrome measurement statistics is well-established for Pauli noise.<n>We propose Bayesian inference methods for general noise models, integrating a tensor network simulator of surface code.<n>We present numerical results of applying our proposed methods to various noise models, such as static, time-varying, and nonuniform cases.
arXiv Detail & Related papers (2024-06-13T10:26:04Z)
Impact of Noisy Supervision in Foundation Model Learning [91.56591923244943]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.<n>We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z)
Understanding the Effect of Noise in LLM Training Data with Algorithmic Chains of Thought [0.0]
We study how noise in chain of thought impacts task performance in highly-controlled setting. We define two types of noise: textitstatic noise, a local form of noise which is applied after the CoT trace is computed, and textitdynamic noise, a global form of noise which propagates errors in the trace as it is computed. We find fine-tuned models are extremely robust to high levels of static noise but struggle significantly more with lower levels of dynamic noise.
arXiv Detail & Related papers (2024-02-06T13:59:56Z)
Optimizing the Noise in Self-Supervised Learning: from Importance Sampling to Noise-Contrastive Estimation [80.07065346699005]
It is widely assumed that the optimal noise distribution should be made equal to the data distribution, as in Generative Adversarial Networks (GANs) We turn to Noise-Contrastive Estimation which grounds this self-supervised task as an estimation problem of an energy-based model of the data. We soberly conclude that the optimal noise may be hard to sample from, and the gain in efficiency can be modest compared to choosing the noise distribution equal to the data's.
arXiv Detail & Related papers (2023-01-23T19:57:58Z)
The Optimal Noise in Noise-Contrastive Learning Is Not What You Think [80.07065346699005]
We show that deviating from this assumption can actually lead to better statistical estimators. In particular, the optimal noise distribution is different from the data's and even from a different family.
arXiv Detail & Related papers (2022-03-02T13:59:20Z)
Label noise detection under the Noise at Random model with ensemble filters [5.994719700262245]
This work investigates the performance of ensemble noise detection under two different noise models. We investigate the effect of class distribution on noise detection performance since it changes the total noise level observed in a dataset.
arXiv Detail & Related papers (2021-12-02T21:49:41Z)
Rethinking Noise Synthesis and Modeling in Raw Denoising [75.55136662685341]
We introduce a new perspective to synthesize noise by directly sampling from the sensor's real noise. It inherently generates accurate raw image noise for different camera sensors.
arXiv Detail & Related papers (2021-10-10T10:45:24Z)
Adaptive Multi-View ICA: Estimation of noise levels for optimal inference [65.94843987207445]
Adaptive multiView ICA (AVICA) is a noisy ICA model where each view is a linear mixture of shared independent sources with additive noise on the sources. On synthetic data, AVICA yields better sources estimates than other group ICA methods thanks to its explicit MMSE estimator. On real magnetoencephalograpy (MEG) data, we provide evidence that the decomposition is less sensitive to sampling noise and that the noise variance estimates are biologically plausible.
arXiv Detail & Related papers (2021-02-22T13:10:12Z)
Learning based signal detection for MIMO systems with unknown noise statistics [84.02122699723536]
This paper aims to devise a generalized maximum likelihood (ML) estimator to robustly detect signals with unknown noise statistics. In practice, there is little or even no statistical knowledge on the system noise, which in many cases is non-Gaussian, impulsive and not analyzable. Our framework is driven by an unsupervised learning approach, where only the noise samples are required.
arXiv Detail & Related papers (2021-01-21T04:48:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.