Related papers: Don't stop me now: Rethinking Validation Criteria for Model Parameter Selection

Don't stop me now: Rethinking Validation Criteria for Model Parameter Selection

URL: http://arxiv.org/abs/2602.22107v1
Date: Wed, 25 Feb 2026 16:56:14 GMT
Title: Don't stop me now: Rethinking Validation Criteria for Model Parameter Selection
Authors: Andrea Apicella, Francesco Isgrò, Andrea Pollastro, Roberto Prevete,
Abstract summary: We study how the validation criterion used for model selection affects test performance in neural classifiers.<n>Early stopping based on validation accuracy performs worst, consistently selecting checkpoints with lower test accuracy.<n>Loss-based validation criteria yield comparable and more stable test accuracy.
Score: 3.219880761967806
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the extensive literature on training loss functions, the evaluation of generalization on the validation set remains underexplored. In this work, we conduct a systematic empirical and statistical study of how the validation criterion used for model selection affects test performance in neural classifiers, with attention to early stopping. Using fully connected networks on standard benchmarks under $k$-fold evaluation, we compare: (i) early stopping with patience and (ii) post-hoc selection over all epochs (i.e. no early stopping). Models are trained with cross-entropy, C-Loss, or PolyLoss; the model parameter selection on the validation set is made using accuracy or one of the three loss functions, each considered independently. Three main findings emerge. (1) Early stopping based on validation accuracy performs worst, consistently selecting checkpoints with lower test accuracy than both loss-based early stopping and post-hoc selection. (2) Loss-based validation criteria yield comparable and more stable test accuracy. (3) Across datasets and folds, any single validation rule often underperforms the test-optimal checkpoint. Overall, the selected model typically achieves test-set performance statistically lower than the best performance across all epochs, regardless of the validation criterion. Our results suggest avoiding validation accuracy (in particular with early stopping) for parameter selection, favoring loss-based validation criteria.

Related papers

PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering [71.15346406323827]
We introduce PRIME, a benchmark for evaluating verifiers on Process-Outcome Alignment verification.<n>We find that current verifiers frequently fail to detect derivation flaws.<n>We propose a process-aware RLVR training paradigm utilizing verifiers selected via PRIME.
arXiv Detail & Related papers (2026-02-12T04:45:01Z)
Uncertainty-Guided Checkpoint Selection for Reinforcement Finetuning of Large Language Models [27.97382399449914]
Reinforcement learning (RL) finetuning is crucial to aligning large language models (LLMs), but the process is notoriously unstable.<n>In practice, selecting the best checkpoint is challenging: evaluating checkpoints on the validation set during training is computationally expensive and requires a good validation set.<n>We introduce an uncertainty-guided approach for checkpoint selection (UGCS) that avoids these pitfalls.
arXiv Detail & Related papers (2025-11-13T01:46:58Z)
Suitability Filter: A Statistical Framework for Classifier Evaluation in Real-World Deployment Settings [33.080398349395686]
We propose a novel framework designed to detect performance deterioration by utilizing suitability signals.<n>We aggregate suitability signals for both test and user data and compare these empirical distributions.<n>This enables proactive mitigation of potential failures in high-stakes applications.
arXiv Detail & Related papers (2025-05-28T13:37:04Z)
Don't Waste Your Time: Early Stopping Cross-Validation [41.092016771160566]
Cross-validation drastically increases the computational cost of validating a single configuration. Our study shows that even a simple-to-understand and easy-to-implement method consistently allows model selection to converge faster.
arXiv Detail & Related papers (2024-05-06T11:51:09Z)
Uncertainty-Calibrated Test-Time Model Adaptation without Forgetting [65.21599711087538]
Test-time adaptation (TTA) seeks to tackle potential distribution shifts between training and test data by adapting a given model w.r.t. any test sample.<n>Prior methods perform backpropagation for each test sample, resulting in unbearable optimization costs to many applications.<n>We propose an Efficient Anti-Forgetting Test-Time Adaptation (EATA) method which develops an active sample selection criterion to identify reliable and non-redundant samples.
arXiv Detail & Related papers (2024-03-18T05:49:45Z)
On Pitfalls of Test-Time Adaptation [82.8392232222119]
Test-Time Adaptation (TTA) has emerged as a promising approach for tackling the robustness challenge under distribution shifts. We present TTAB, a test-time adaptation benchmark that encompasses ten state-of-the-art algorithms, a diverse array of distribution shifts, and two evaluation protocols.
arXiv Detail & Related papers (2023-06-06T09:35:29Z)
DELTA: degradation-free fully test-time adaptation [59.74287982885375]
We find that two unfavorable defects are concealed in the prevalent adaptation methodologies like test-time batch normalization (BN) and self-learning. First, we reveal that the normalization statistics in test-time BN are completely affected by the currently received test samples, resulting in inaccurate estimates. Second, we show that during test-time adaptation, the parameter update is biased towards some dominant classes.
arXiv Detail & Related papers (2023-01-30T15:54:00Z)
Sequential Kernelized Independence Testing [77.237958592189]
We design sequential kernelized independence tests inspired by kernelized dependence measures.<n>We demonstrate the power of our approaches on both simulated and real data.
arXiv Detail & Related papers (2022-12-14T18:08:42Z)
Three New Validators and a Large-Scale Benchmark Ranking for Unsupervised Domain Adaptation [37.03614011735927]
We propose three new validators for unsupervised domain adaptation (UDA) We compare and rank them against five other existing validators, on a large dataset of 1,000,000 checkpoints. We find that two of our proposed validators achieve state-of-the-art performance in various settings.
arXiv Detail & Related papers (2022-08-15T17:55:26Z)
Cross-validation Confidence Intervals for Test Error [83.67415139421448]
This work develops central limit theorems for crossvalidation and consistent estimators of its variance under weak stability conditions on the learning algorithm. Results are the first of their kind for the popular choice of leave-one-out cross-validation.
arXiv Detail & Related papers (2020-07-24T17:40:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.