Weight Decay may matter more than muP for Learning Rate Transfer in Practice
- URL: http://arxiv.org/abs/2510.19093v1
- Date: Tue, 21 Oct 2025 21:36:14 GMT
- Title: Weight Decay may matter more than muP for Learning Rate Transfer in Practice
- Authors: Atli Kosson, Jeremy Welborn, Yang Liu, Martin Jaggi, Xi Chen,
- Abstract summary: We show that the scaling rules of muP rely on strong assumptions about the geometric alignment of a layer's inputs with both its weights and gradient updates.<n>For the remainder of training it is weight decay rather than muP that correctly stabilizes the update dynamics of internal representations across widths.<n>This suggests muP's scaling primarily acts as a form of implicit learning rate warmup, allowing us to largely replace it with modified warmup schedules.
- Score: 43.243484751818066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transferring the optimal learning rate from small to large neural networks can enable efficient training at scales where hyperparameter tuning is otherwise prohibitively expensive. To this end, the Maximal Update Parameterization (muP) proposes a learning rate scaling designed to keep the update dynamics of internal representations stable across different model widths. However, the scaling rules of muP rely on strong assumptions, particularly about the geometric alignment of a layer's inputs with both its weights and gradient updates. In this large-scale empirical investigation, we show that these assumptions hold only briefly at the start of training in the practical setups where learning rate transfer is most valuable, such as LLM training. For the remainder of training it is weight decay rather than muP that correctly stabilizes the update dynamics of internal representations across widths, facilitating learning rate transfer. This suggests muP's scaling primarily acts as a form of implicit learning rate warmup, allowing us to largely replace it with modified warmup schedules. Together these findings fundamentally challenge prevailing beliefs about learning rate transfer and can explain empirical practice such as why muP requires the independent weight decay variant for successful transfer.
Related papers
- How to Set the Learning Rate for Large-Scale Pre-training? [73.03133634525635]
We formalize this investigation into two distinct research paradigms: Fitting and Transfer.<n>Within the Fitting Paradigm, we introduce a Scaling Law for search factor, effectively reducing the search complexity from O(n3) to O(n*C_D*C_) via predictive modeling.<n>We extend the principles of $$Transfer to the Mixture of Experts (MoE) architecture, broadening its applicability to encompass model depth, weight decay, and token horizons.
arXiv Detail & Related papers (2026-01-08T15:55:13Z) - Unveiling the Role of Learning Rate Schedules via Functional Scaling Laws [9.332823269318842]
Scaling laws have played a cornerstone role in guiding the training of large language models (LLMs)<n>We introduce the Functional Scaling Law, which characterizes the evolution of population risk during the training process for general LRSs.<n>We analyze three widely used LRSs -- constant, exponential decay, and warmup-stable-decay (WSD) -- under both data-limited and compute-limited regimes.
arXiv Detail & Related papers (2025-09-23T16:05:16Z) - Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training [35.81422928960327]
We show that Schedule-Free (SF) effectively navigates the "river" structure of the loss landscape without decay phases or auxiliary averaging.<n>We propose a refined variant of SF that improves to momentum and performs better under large batch sizes.
arXiv Detail & Related papers (2025-07-14T00:54:48Z) - Pay Attention to Small Weights [26.613296190219103]
NanoADAM dynamically updates only the small-magnitude weights during finetuning.<n>It preserves large-magnitude weights, which are likely to encode critical features learned during pretraining.
arXiv Detail & Related papers (2025-06-26T15:22:55Z) - CLASSP: a Biologically-Inspired Approach to Continual Learning through Adjustment Suppression and Sparsity Promotion [0.0]
This paper introduces a new training method named Continual Learning through Adjustment Suppression and Sparsity Promotion (CLASSP)
CLASSP is based on two main principles observed in neuroscience, particularly in the context of synaptic transmission and Long-Term Potentiation.
When compared with Elastic Weight Consolidation (EWC) datasets, CLASSP demonstrates superior performance in terms of accuracy and memory footprint.
arXiv Detail & Related papers (2024-04-29T13:31:00Z) - Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks [33.88586668321127]
This study investigates how weight decay affects the update behavior of individual neurons in deep neural networks.
We show that explicitly controlling the rotation provides the benefits of weight decay while substantially reducing the need for learning rate warmup.
arXiv Detail & Related papers (2023-05-26T19:14:01Z) - Meta-Learning Fast Weight Language Models [105.66999854213724]
We present Fast Weight Layers (FWLs), a neural component that provides the benefits of dynamic evaluation much more efficiently.
FWLs can be applied at training time so the model learns to make good use of gradient updates.
arXiv Detail & Related papers (2022-12-05T18:37:09Z) - Flatter, faster: scaling momentum for optimal speedup of SGD [0.0]
We study training dynamics arising from interplay between gradient descent (SGD) and label noise and momentum in the training of neural networks.
We find that scaling the momentum hyper parameter $1-NISTbeta$ with the learning rate to the power of $2/3$ maximally accelerates training, without sacrificing generalization.
arXiv Detail & Related papers (2022-10-28T20:41:48Z) - Rethinking Importance Weighting for Transfer Learning [71.81262398144946]
Key assumption in supervised learning is that training and test data follow the same probability distribution.
As real-world machine learning tasks are becoming increasingly complex, novel approaches are explored to cope with such challenges.
arXiv Detail & Related papers (2021-12-19T14:35:25Z) - RIFLE: Backpropagation in Depth for Deep Transfer Learning through
Re-Initializing the Fully-connected LayEr [60.07531696857743]
Fine-tuning the deep convolution neural network(CNN) using a pre-trained model helps transfer knowledge learned from larger datasets to the target task.
We propose RIFLE - a strategy that deepens backpropagation in transfer learning settings.
RIFLE brings meaningful updates to the weights of deep CNN layers and improves low-level feature learning.
arXiv Detail & Related papers (2020-07-07T11:27:43Z) - AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS)
Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.