Related papers: How to set AdamW's weight decay as you scale model and dataset size

How to set AdamW's weight decay as you scale model and dataset size

URL: http://arxiv.org/abs/2405.13698v2
Date: Sun, 02 Feb 2025 21:46:50 GMT
Title: How to set AdamW's weight decay as you scale model and dataset size
Authors: Xi Wang, Laurence Aitchison,
Abstract summary: We show that weights learned by AdamW can be understood as an exponential moving average (EMA) of recent updates.<n>This gives critical insights for how to set the weight decay in AdamW, and how the weight decay should scale with model and dataset size.
Score: 29.980824873382833
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The scaling of the optimal AdamW weight decay hyperparameter with model and dataset size is critical as we seek to build larger models, but is poorly understood. We show that weights learned by AdamW can be understood as an exponential moving average (EMA) of recent updates. This gives critical insights for how to set the weight decay in AdamW, and how the weight decay should scale with model and dataset size. In particular, the key hyperparameter for an exponential moving average is the EMA timescale. Intuitively, the EMA timescale can be understood as the number of recent iterations the EMA averages over. We find that the optimal timescale, measured in epochs, is roughly constant as we change model and dataset size. Moreover, given a learning rate, there is a one-to-one mapping from the EMA timescale to the weight decay hyperparameter. Thus, if the optimal EMA timescale is constant, that implies that as the dataset size increases, the optimal weight decay should fall and as the model size increases, the optimal weight decay should increase (if we follow the muP recommendation for scaling the learning rate). We validate these scaling rules on ResNet-18 and Vision Transformers trained on CIFAR-10 and ImageNet, and on NanoGPT pre-training on OpenWebText. Finally, we found that as training progresses, muP's learning rate scaling breaks down for AdamW unless weight decay is scaled appropriately.

Related papers

Rethinking Weight Decay for Robust Fine-Tuning of Foundation Models [27.847140934456288]
This paper proposes a new weight decay technique, Selective Projection Decay (SPD) SPD selectively imposes a strong penalty on certain layers while allowing others to change freely. When equipped with SPD, Adam consistently provides better in-distribution robustness and out-of-distribution performance on benchmarks.
arXiv Detail & Related papers (2024-11-03T23:36:53Z)
AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs [61.13296177652599]
We show that data mixtures that perform well at smaller scales may not retain their advantage at larger scales. We propose AutoScale, a two-stage, scale-aware data composition framework.
arXiv Detail & Related papers (2024-07-29T17:06:30Z)
Efficient Training with Denoised Neural Weights [65.14892033932895]
This work takes a novel step towards building a weight generator to synthesize the neural weights for initialization. We use the image-to-image translation task with generative adversarial networks (GANs) as an example due to the ease of collecting model weights. By initializing the image translation model with the denoised weights predicted by our diffusion model, the training requires only 43.3 seconds.
arXiv Detail & Related papers (2024-07-16T17:59:42Z)
Weights Augmentation: it has never ever ever ever let her model down [1.5020330976600735]
This article proposes the concept of weight augmentation, focusing on weight exploration. Weight Augmentation Strategy (WAS) is to adopt random transformed weight coefficients training and transformed, named Shadow Weight(SW), for networks that can be used to calculate loss function. Our experimental results show that convolutional neural networks, such as VGG-16, ResNet-18, ResNet-34, GoogleNet, MobilementV2, and Efficientment-Lite, can benefit much at little or no cost.
arXiv Detail & Related papers (2024-05-30T00:57:06Z)
A Tale of Tails: Model Collapse as a Change of Scaling Laws [11.6055501181235]
We ask: How will the scaling laws change in the inevitable regime where synthetic data makes its way into the training corpus? We develop a theoretical framework of model collapse through the lens of scaling laws. We discover a wide range of decay phenomena, analyzing loss of scaling, shifted scaling with number of generations, the ''un-learning" of skills, and grokking when mixing human and synthesized data.
arXiv Detail & Related papers (2024-02-10T21:06:34Z)
A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization. Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z)
Why Do We Need Weight Decay in Modern Deep Learning? [24.81634291051533]
Weight decay is a technique for training state-of-the-art deep networks from image classification to large language models. In this work, we highlight that the role of weight decay in modern deep learning is different from its regularization effect studied in classical learning theory. For deep networks on vision tasks trained with multipass SGD, we show how weight decay modifies the optimization dynamics enhancing the implicit regularization of SGD.
arXiv Detail & Related papers (2023-10-06T17:58:21Z)
How to Scale Your EMA [20.94711634514331]
We provide a scaling rule for optimization in the presence of a model EMA. We show the rule's validity where the model EMA contributes to the optimization of the target model. For Self-Supervised Learning, we enable training of BYOL up to batch size 24,576 without sacrificing performance.
arXiv Detail & Related papers (2023-07-25T20:33:48Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)
Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale [16.97880876259831]
Amos is a gradient-based system for training deep neural networks. It can be viewed as an Adam with theoretically supported, adaptive learning-rate decay and weight decay.
arXiv Detail & Related papers (2022-10-21T02:37:58Z)
Long-Tailed Recognition via Weight Balancing [66.03068252811993]
Naive training produces models that are biased toward common classes in terms of higher accuracy. We investigate three techniques to balance weights, L2-normalization, weight decay, and MaxNorm. Our approach achieves the state-of-the-art accuracy on five standard benchmarks.
arXiv Detail & Related papers (2022-03-27T03:26:31Z)
Robust Training of Neural Networks using Scale Invariant Architectures [70.67803417918854]
In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks. We show that this general approach is robust to rescaling of parameter and loss. We design a scale invariant version of BERT, called SIBERT, which when trained simply by vanilla SGD achieves performance comparable to BERT trained by adaptive methods like Adam.
arXiv Detail & Related papers (2022-02-02T11:58:56Z)
AdaScale SGD: A User-Friendly Algorithm for Distributed Training [29.430153773234363]
We propose AdaScale SGD, an algorithm that reliably adapts learning rates to large-batch training. By continually adapting to the gradient's variance, AdaScale achieves speed-ups for a wide range of batch sizes. This includes large-batch training with no model degradation for machine translation, image classification, object detection, and speech recognition tasks.
arXiv Detail & Related papers (2020-07-09T23:26:13Z)
AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning. It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights. In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z)
Anatomy-aware 3D Human Pose Estimation with Bone-based Pose Decomposition [92.99291528676021]
Instead of directly regressing the 3D joint locations, we decompose the task into bone direction prediction and bone length prediction. Our motivation is the fact that the bone lengths of a human skeleton remain consistent across time. Our full model outperforms the previous best results on Human3.6M and MPI-INF-3DHP datasets.
arXiv Detail & Related papers (2020-02-24T15:49:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.