Related papers: When LLMs get significantly worse: A statistical approach to detect model degradations

When LLMs get significantly worse: A statistical approach to detect model degradations

URL: http://arxiv.org/abs/2602.10144v1
Date: Mon, 09 Feb 2026 10:45:13 GMT
Title: When LLMs get significantly worse: A statistical approach to detect model degradations
Authors: Jonas Kübler, Kailash Budhathoki, Matthäus Kleindessner, Xiong Zhou, Junming Yin, Ashish Khetan, George Karypis,
Abstract summary: Minimizing the inference cost and latency of foundation models has become a crucial area of research.<n>We propose a statistically sound hypothesis testing framework based on McNemar's test allowing to efficiently detect model degradations.<n>We find that with our tests even empirical accuracy degradations of 0.3% can be confidently attributed to actual degradations rather than noise.
Score: 33.63321816712603
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Minimizing the inference cost and latency of foundation models has become a crucial area of research. Optimization approaches include theoretically lossless methods and others without accuracy guarantees like quantization. In all of these cases it is crucial to ensure that the model quality has not degraded. However, even at temperature zero, model generations are not necessarily robust even to theoretically lossless model optimizations due to numerical errors. We thus require statistical tools to decide whether a finite-sample accuracy deviation is an evidence of a model's degradation or whether it can be attributed to (harmless) noise in the evaluation. We propose a statistically sound hypothesis testing framework based on McNemar's test allowing to efficiently detect model degradations, while guaranteeing a controlled rate of false positives. The crucial insight is that we have to confront the model scores on each sample, rather than aggregated on the task level. Furthermore, we propose three approaches to aggregate accuracy estimates across multiple benchmarks into a single decision. We provide an implementation on top of the largely adopted open source LM Evaluation Harness and provide a case study illustrating that the method correctly flags degraded models, while not flagging model optimizations that are provably lossless. We find that with our tests even empirical accuracy degradations of 0.3% can be confidently attributed to actual degradations rather than noise.

Related papers

Towards Anytime-Valid Statistical Watermarking [63.02116925616554]
We develop the first e-value-based watermarking framework, Anchored E-Watermarking, that unifies optimal sampling with anytime-valid inference.<n>Our framework can significantly enhance sample efficiency, reducing the average token budget required for detection by 13-15% relative to state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-19T18:32:26Z)
Uncertainty Weighted Gradients for Model Calibration [22.39558434131574]
Deep networks often produce over-confident or under-confident predictions, leading to miscalibration.<n>We propose a unified loss framework for focal loss and its variants, where we mainly attribute their superiority in model calibration to the loss weighting factor.<n>Our method achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2025-03-26T04:16:05Z)
Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect Verifiers [13.823743787003787]
Recent research has generated hope that inference scaling could allow weaker language models to match or exceed the accuracy of stronger models.<n>We show that no amount of inference scaling of weaker models can enable them to match the single-sample accuracy of a sufficiently strong model.<n>We also show that beyond accuracy, false positives may have other undesirable qualities, such as poor adherence to coding style conventions.
arXiv Detail & Related papers (2024-11-26T15:13:06Z)
A Probabilistic Perspective on Unlearning and Alignment for Large Language Models [48.96686419141881]
We introduce the first formal probabilistic evaluation framework for Large Language Models (LLMs)<n> Namely, we propose novel metrics with high probability guarantees concerning the output distribution of a model.<n>Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment.
arXiv Detail & Related papers (2024-10-04T15:44:23Z)
Source-Free Domain-Invariant Performance Prediction [68.39031800809553]
We propose a source-free approach centred on uncertainty-based estimation, using a generative model for calibration in the absence of source data. Our experiments on benchmark object recognition datasets reveal that existing source-based methods fall short with limited source sample availability. Our approach significantly outperforms the current state-of-the-art source-free and source-based methods, affirming its effectiveness in domain-invariant performance estimation.
arXiv Detail & Related papers (2024-08-05T03:18:58Z)
Uncertainty-Calibrated Test-Time Model Adaptation without Forgetting [65.21599711087538]
Test-time adaptation (TTA) seeks to tackle potential distribution shifts between training and test data by adapting a given model w.r.t. any test sample.<n>Prior methods perform backpropagation for each test sample, resulting in unbearable optimization costs to many applications.<n>We propose an Efficient Anti-Forgetting Test-Time Adaptation (EATA) method which develops an active sample selection criterion to identify reliable and non-redundant samples.
arXiv Detail & Related papers (2024-03-18T05:49:45Z)
ALUM: Adversarial Data Uncertainty Modeling from Latent Model Uncertainty Compensation [25.67258563807856]
We propose a novel method called ALUM to handle the model uncertainty and data uncertainty in a unified scheme. Our proposed ALUM is model-agnostic which can be easily implemented into any existing deep model with little extra overhead.
arXiv Detail & Related papers (2023-03-29T17:24:12Z)
Reliability-Aware Prediction via Uncertainty Learning for Person Image Retrieval [51.83967175585896]
UAL aims at providing reliability-aware predictions by considering data uncertainty and model uncertainty simultaneously. Data uncertainty captures the noise" inherent in the sample, while model uncertainty depicts the model's confidence in the sample's prediction.
arXiv Detail & Related papers (2022-10-24T17:53:20Z)
Monitoring Model Deterioration with Explainable Uncertainty Estimation via Non-parametric Bootstrap [0.0]
Monitoring machine learning models once they are deployed is challenging. It is even more challenging to decide when to retrain models in real-case scenarios when labeled data is beyond reach. In this work, we use non-parametric bootstrapped uncertainty estimates and SHAP values to provide explainable uncertainty estimation.
arXiv Detail & Related papers (2022-01-27T17:23:04Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.