Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging
- URL: http://arxiv.org/abs/2602.03702v1
- Date: Tue, 03 Feb 2026 16:24:05 GMT
- Title: Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging
- Authors: Alexandru Meterez, Pranav Ajit Nair, Depen Morwani, Cengiz Pehlevan, Sham Kakade,
- Abstract summary: We show that language models trained at 1-32x Chinchilla scale decay with time, with the decay rate determined by the source and capacity conditions of the problem.<n>Our results suggest that weight averaging combined with simple, horizon-free step sizes offers a practical and effective anytime alternative to cosine learning rate schedules for large language model pretraining.
- Score: 70.05077723711618
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models are increasingly trained in continual or open-ended settings, where the total training horizon is not known in advance. Despite this, most existing pretraining recipes are not anytime: they rely on horizon-dependent learning rate schedules and extensive tuning under a fixed compute budget. In this work, we provide a theoretical analysis demonstrating the existence of anytime learning schedules for overparameterized linear regression, and we highlight the central role of weight averaging - also known as model merging - in achieving the minimax convergence rates of stochastic gradient descent. We show that these anytime schedules polynomially decay with time, with the decay rate determined by the source and capacity conditions of the problem. Empirically, we evaluate 150M and 300M parameter language models trained at 1-32x Chinchilla scale, comparing constant learning rates with weight averaging and $1/\sqrt{t}$ schedules with weight averaging against a well-tuned cosine schedule. Across the full training range, the anytime schedules achieve comparable final loss to cosine decay. Taken together, our results suggest that weight averaging combined with simple, horizon-free step sizes offers a practical and effective anytime alternative to cosine learning rate schedules for large language model pretraining.
Related papers
- Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling [75.36692892951018]
Increasing the batch size during training is a promising strategy to accelerate large language model pretraining.<n>This work develops a principled framework for batch-size scheduling.<n>It introduces Seesaw: whenever a standard scheduler would halve the learning rate, Seesaw instead multiplies it by $1/sqrt2$ and doubles the batch size.
arXiv Detail & Related papers (2025-10-16T14:17:38Z) - A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules [67.87680482844884]
We present an empirical law that describes how the pretraining loss of large language models evolves under different learning rate schedules.<n>Our proposed law takes a multi-power form, combining a power law based on the sum of learning rates and additional power laws to account for a loss reduction effect induced by learning rate decay.
arXiv Detail & Related papers (2025-03-17T04:36:45Z) - Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training [28.73282119260246]
Unlabeled data presents both opportunities and challenges for training artificial intelligence systems.<n>While self-supervised learning has emerged as a powerful paradigm for extracting meaningful representations from vast amounts of unlabeled data, existing methods still struggle to adapt to the non-stationary, non-IID nature of real-world data streams without forgetting previously learned knowledge.<n>In this work, we systematically compare the widely used cosine schedule with the recently proposed infinite learning rate schedule and empirically find the latter to be a more effective alternative.
arXiv Detail & Related papers (2025-03-04T18:15:57Z) - The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws [51.608402959163925]
We present the first systematic exploration of optimal sparse pre-training configurations for large language models.<n>We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss.<n>We propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training.
arXiv Detail & Related papers (2025-01-21T20:23:22Z) - Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations [62.132347451049455]
Scale has become a main ingredient in obtaining strong machine learning models.
In this work, we argue that scale and training research has been needlessly complex due to reliance on the cosine schedule.
We show that weight averaging yields improved performance along the training trajectory, without additional training costs, across different scales.
arXiv Detail & Related papers (2024-05-28T17:33:54Z) - Optimal Linear Decay Learning Rate Schedules and Further Refinements [46.79573408189601]
Learning rate schedules used in practice bear little resemblance to those recommended by theory.
We close much of this theory/practice gap, and as a consequence are able to derive new problem-adaptive learning rate schedules.
arXiv Detail & Related papers (2023-10-11T19:16:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.