Related papers: Scaling and Transferability of Annealing Strategies in Large Language Model Training

Scaling and Transferability of Annealing Strategies in Large Language Model Training

URL: http://arxiv.org/abs/2512.13705v1
Date: Fri, 05 Dec 2025 16:38:33 GMT
Title: Scaling and Transferability of Annealing Strategies in Large Language Model Training
Authors: Siqi Wang, Zhengyu Chen, Teng Xiao, Zheqi Lv, Jinluan Yang, Xunliang Cai, Jingang Wang, Xiaomeng Li,
Abstract summary: We refine a predictive framework for optimizing annealing strategies under the Warmup-Steady-Decay (WSD) scheduler.<n>Our improved framework incorporates training steps, maximum learning rate, and annealing behavior, enabling more efficient optimization of learning rate schedules.<n>We validate our findings on extensive experiments using both Dense and Mixture-of-Experts (MoE) models.
Score: 59.443651879173025
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Learning rate scheduling is crucial for training large language models, yet understanding the optimal annealing strategies across different model configurations remains challenging. In this work, we investigate the transferability of annealing dynamics in large language model training and refine a generalized predictive framework for optimizing annealing strategies under the Warmup-Steady-Decay (WSD) scheduler. Our improved framework incorporates training steps, maximum learning rate, and annealing behavior, enabling more efficient optimization of learning rate schedules. Our work provides a practical guidance for selecting optimal annealing strategies without exhaustive hyperparameter searches, demonstrating that smaller models can serve as reliable proxies for optimizing the training dynamics of larger models. We validate our findings on extensive experiments using both Dense and Mixture-of-Experts (MoE) models, demonstrating that optimal annealing ratios follow consistent patterns and can be transferred across different training configurations.

Related papers

Implicit Modeling for Transferability Estimation of Vision Foundation Models [33.73062179456684]
Implicit Transferability Modeling (ITM) is a novel framework that implicitly models each model's intrinsic transferability.<n>ITM consistently outperforms existing methods in terms of stability, effectiveness, and efficiency.
arXiv Detail & Related papers (2025-10-27T09:21:19Z)
Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts [113.0656076371565]
We propose a novel router-aware approach to optimize importance sampling weights in off-policy reinforcement learning (RL)<n>Specifically, we design a rescaling strategy guided by router logits, which effectively reduces gradient variance and mitigates training divergence.<n> Experimental results demonstrate that our method significantly improves both the convergence stability and the final performance of MoE models.
arXiv Detail & Related papers (2025-10-27T05:47:48Z)
AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining [12.630306478872043]
We propose textbfAdaLRS, a plug-in-and-play adaptive learning rate search algorithm that conducts online optimal learning rate search.<n>Experiments show that AdaLRS adjusts suboptimal learning rates to the neighborhood of optimum with marked efficiency and effectiveness.
arXiv Detail & Related papers (2025-06-16T09:14:01Z)
Optimizing ML Training with Metagradient Descent [69.89631748402377]
We introduce an algorithm for efficiently calculating metagradients -- gradients through model training -- at scale.<n>We then introduce a "smooth model training" framework that enables effective optimization using metagradients.
arXiv Detail & Related papers (2025-03-17T22:18:24Z)
The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training [55.233765889424035]
We show that learning-rate schedules for large model training behave surprisingly similar to a convex bound from non-smooth optimization theory.<n>We achieve noticeable improvements for training 124M and 210M Llama-type models by (i) extending the schedule for continued training with optimal learning-rate, and (ii) transferring the optimal learning-rate across schedules.
arXiv Detail & Related papers (2025-01-31T08:55:56Z)
Learning Dynamic Representations via An Optimally-Weighted Maximum Mean Discrepancy Optimization Framework for Continual Learning [16.10753846850319]
Continual learning allows models to persistently acquire and retain information.<n> catastrophic forgetting can severely impair model performance.<n>We introduce a novel framework termed Optimally-Weighted Mean Discrepancy (OWMMD), which imposes penalties on representation alterations.
arXiv Detail & Related papers (2025-01-21T13:33:45Z)
Optimization Hyper-parameter Laws for Large Language Models [52.49860340549727]
We present Opt-Laws, a framework that captures the relationship between hyper- parameters and training outcomes.<n>Our validation across diverse model sizes and data scales demonstrates Opt-Laws' ability to accurately predict training loss.<n>This approach significantly reduces computational costs while enhancing overall model performance.
arXiv Detail & Related papers (2024-09-07T09:37:19Z)
Planning with Diffusion for Flexible Behavior Synthesis [125.24438991142573]
We consider what it would look like to fold as much of the trajectory optimization pipeline as possible into the modeling problem. The core of our technical approach lies in a diffusion probabilistic model that plans by iteratively denoising trajectories.
arXiv Detail & Related papers (2022-05-20T07:02:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.