Instability in Downstream Task Performance During LLM Pretraining
- URL: http://arxiv.org/abs/2510.04848v1
- Date: Mon, 06 Oct 2025 14:33:38 GMT
- Title: Instability in Downstream Task Performance During LLM Pretraining
- Authors: Yuto Nishida, Masaru Isonuma, Yusuke Oda,
- Abstract summary: We investigate the stability of downstream task performance in a large language model (LLMs) trained on diverse web-scale corpora.<n>We find that task scores frequently fluctuate throughout training, both at the aggregate and example levels.<n>To address this instability, we investigate two post-hoc checkpoint integration methods: checkpoint averaging and ensemble.
- Score: 12.840216854750565
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When training large language models (LLMs), it is common practice to track downstream task performance throughout the training process and select the checkpoint with the highest validation score. However, downstream metrics often exhibit substantial fluctuations, making it difficult to identify the checkpoint that truly represents the best-performing model. In this study, we empirically analyze the stability of downstream task performance in an LLM trained on diverse web-scale corpora. We find that task scores frequently fluctuate throughout training, both at the aggregate and example levels. To address this instability, we investigate two post-hoc checkpoint integration methods: checkpoint averaging and ensemble, motivated by the hypothesis that aggregating neighboring checkpoints can reduce performance volatility. We demonstrate both empirically and theoretically that these methods improve downstream performance stability without requiring any changes to the training procedure.
Related papers
- AnaCP: Toward Upper-Bound Continual Learning via Analytic Contrastive Projection [11.750791465488438]
This paper studies the problem of class-incremental learning (CIL)<n>Traditional CIL methods, which do not leverage pre-trained models (PTMs), suffer from catastrophic forgetting (CF)<n>We propose AnaCP, a novel method that preserves the efficiency of analytic classifiers while enabling incremental feature adaptation without gradient-based training.
arXiv Detail & Related papers (2025-11-17T19:56:15Z) - Uncertainty-Guided Checkpoint Selection for Reinforcement Finetuning of Large Language Models [27.97382399449914]
Reinforcement learning (RL) finetuning is crucial to aligning large language models (LLMs), but the process is notoriously unstable.<n>In practice, selecting the best checkpoint is challenging: evaluating checkpoints on the validation set during training is computationally expensive and requires a good validation set.<n>We introduce an uncertainty-guided approach for checkpoint selection (UGCS) that avoids these pitfalls.
arXiv Detail & Related papers (2025-11-13T01:46:58Z) - BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning [82.925106913459]
Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning.<n>We introduce BOTS, a unified framework for Bayesian Online Task Selection in RFT reinforcement finetuning.
arXiv Detail & Related papers (2025-10-30T11:15:23Z) - MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics [72.00014675808228]
Instability in Large Language Models evaluation process obscures true learning dynamics.<n>We introduce textbfMaP, a framework that integrates underlineMerging underlineand the underlinePass@k metric.<n>Experiments show that MaP yields significantly smoother performance curves, reduces inter-run variance, and ensures more consistent rankings.
arXiv Detail & Related papers (2025-10-10T11:40:27Z) - Parameter-Efficient Checkpoint Merging via Metrics-Weighted Averaging [2.9761595094633435]
Checkpoint merging is a technique for combining multiple model snapshots into a single superior model.<n>This paper explores checkpoint merging in the context of parameter-efficient fine-tuning.<n>We propose Metrics-Weighted Averaging (MWA) to merge model checkpoints by weighting their parameters according to performance metrics.
arXiv Detail & Related papers (2025-04-23T05:11:21Z) - Fast Adaptation with Behavioral Foundation Models [82.34700481726951]
Unsupervised zero-shot reinforcement learning has emerged as a powerful paradigm for pretraining behavioral foundation models.<n>Despite promising results, zero-shot policies are often suboptimal due to errors induced by the unsupervised training process.<n>We propose fast adaptation strategies that search in the low-dimensional task-embedding space of the pre-trained BFM to rapidly improve the performance of its zero-shot policies.
arXiv Detail & Related papers (2025-04-10T16:14:17Z) - SeWA: Selective Weight Average via Probabilistic Masking [51.015724517293236]
We show that only a few points are needed to achieve better and faster convergence.<n>We transform the discrete selection problem into a continuous subset optimization framework.<n>We derive the SeWA's stability bounds, which are sharper than that under both convex image checkpoints.
arXiv Detail & Related papers (2025-02-14T12:35:21Z) - Early-Stage Anomaly Detection: A Study of Model Performance on Complete vs. Partial Flows [0.0]
This study investigates the efficacy of machine learning models in network security threat detection through the critical lens of partial versus complete flow information.<n>We evaluate how a standard benchmark model, Random Forest, performs under varying training and testing conditions.
arXiv Detail & Related papers (2024-07-03T07:14:25Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis [27.310894780313618]
This paper undertakes a comprehensive comparison of model capabilities at various pretraining intermediate checkpoints.
We confirm that specific downstream metrics exhibit similar training dynamics across models of different sizes.
In addition to our core findings, we've reproduced Amber and OpenLLaMA, releasing their intermediate checkpoints.
arXiv Detail & Related papers (2024-04-01T16:00:01Z) - Large Language Models are Miscalibrated In-Context Learners [22.30783674111999]
In this work, we deliver an in-depth analysis of the behavior across different choices of learning methods.<n>We observe that the miscalibration problem exists across all learning methods in low-resource setups.<n>We find that self-ensembling with max probability produces robust and calibrated predictions.
arXiv Detail & Related papers (2023-12-21T11:55:10Z) - Test-Time Adaptation with Perturbation Consistency Learning [32.58879780726279]
We propose a simple test-time adaptation method to promote the model to make stable predictions for samples with distribution shifts.
Our method can achieve higher or comparable performance with less inference time over strong PLM backbones.
arXiv Detail & Related papers (2023-04-25T12:29:22Z) - Average of Pruning: Improving Performance and Stability of
Out-of-Distribution Detection [37.43981354073841]
We find the performance of OOD detection suffers from overfitting and instability during training.
We propose Average of Pruning (AoP), consisting of model averaging and pruning, to mitigate the unstable behaviors.
arXiv Detail & Related papers (2023-03-02T12:34:38Z) - DELTA: degradation-free fully test-time adaptation [59.74287982885375]
We find that two unfavorable defects are concealed in the prevalent adaptation methodologies like test-time batch normalization (BN) and self-learning.
First, we reveal that the normalization statistics in test-time BN are completely affected by the currently received test samples, resulting in inaccurate estimates.
Second, we show that during test-time adaptation, the parameter update is biased towards some dominant classes.
arXiv Detail & Related papers (2023-01-30T15:54:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.