Why Do We Need Weight Decay in Modern Deep Learning?
- URL: http://arxiv.org/abs/2310.04415v1
- Date: Fri, 6 Oct 2023 17:58:21 GMT
- Title: Why Do We Need Weight Decay in Modern Deep Learning?
- Authors: Maksym Andriushchenko and Francesco D'Angelo and Aditya Varre and
Nicolas Flammarion
- Abstract summary: Weight decay is a technique for training state-of-the-art deep networks, including large language models.
In this work, we highlight that the role of weight decay in modern deep learning is different from its regularization effect studied in classical learning theory.
We show how weight decay modifies the optimization dynamics enhancing the ever-present implicit regularization of SGD.
- Score: 27.110071835818808
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weight decay is a broadly used technique for training state-of-the-art deep
networks, including large language models. Despite its widespread usage, its
role remains poorly understood. In this work, we highlight that the role of
weight decay in modern deep learning is different from its regularization
effect studied in classical learning theory. For overparameterized deep
networks, we show how weight decay modifies the optimization dynamics enhancing
the ever-present implicit regularization of SGD via the loss stabilization
mechanism. In contrast, for underparameterized large language models trained
with nearly online SGD, we describe how weight decay balances the bias-variance
tradeoff in stochastic optimization leading to lower training loss. Moreover,
we show that weight decay also prevents sudden loss divergences for bfloat16
mixed-precision training which is a crucial tool for LLM training. Overall, we
present a unifying perspective from ResNets on vision tasks to LLMs: weight
decay is never useful as an explicit regularizer but instead changes the
training dynamics in a desirable way. Our code is available at
https://github.com/tml-epfl/why-weight-decay.
Related papers
- Why do Learning Rates Transfer? Reconciling Optimization and Scaling
Limits for Deep Learning [77.82908213345864]
We find empirical evidence that learning rate transfer can be attributed to the fact that under $mu$P and its depth extension, the largest eigenvalue of the training loss Hessian is largely independent of the width and depth of the network.
We show that under the neural tangent kernel (NTK) regime, the sharpness exhibits very different dynamics at different scales, thus preventing learning rate transfer.
arXiv Detail & Related papers (2024-02-27T12:28:01Z) - FedNAR: Federated Optimization with Normalized Annealing Regularization [54.42032094044368]
We explore the choices of weight decay and identify that weight decay value appreciably influences the convergence of existing FL algorithms.
We develop Federated optimization with Normalized Annealing Regularization (FedNAR), a plug-in that can be seamlessly integrated into any existing FL algorithms.
arXiv Detail & Related papers (2023-10-04T21:11:40Z) - Weight Compander: A Simple Weight Reparameterization for Regularization [5.744133015573047]
We introduce weight compander, a novel effective method to improve generalization of deep neural networks.
We show experimentally that using weight compander in addition to standard regularization methods improves the performance of neural networks.
arXiv Detail & Related papers (2023-06-29T14:52:04Z) - Long-Tailed Recognition via Weight Balancing [66.03068252811993]
Naive training produces models that are biased toward common classes in terms of higher accuracy.
We investigate three techniques to balance weights, L2-normalization, weight decay, and MaxNorm.
Our approach achieves the state-of-the-art accuracy on five standard benchmarks.
arXiv Detail & Related papers (2022-03-27T03:26:31Z) - FixNorm: Dissecting Weight Decay for Training Deep Neural Networks [7.820667552233989]
We propose a new training method called FixNorm, which discards weight decay and directly controls the two mechanisms.
On ImageNet classification task, training EfficientNet-B0 with FixNorm achieves 77.7%, which outperforms the original baseline by a clear margin.
arXiv Detail & Related papers (2021-03-29T05:41:56Z) - The Implicit Biases of Stochastic Gradient Descent on Deep Neural
Networks with Batch Normalization [44.30960913470372]
Deep neural networks with batch normalization (BN-DNNs) are invariant to weight rescaling due to their normalization operations.
We investigate the implicit biases of gradient descent (SGD) on BN-DNNs to provide a theoretical explanation for the efficacy of weight decay.
arXiv Detail & Related papers (2021-02-06T03:40:20Z) - On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A
Gradient-Norm Perspective [96.97587309301719]
We present the first practical scheduler for weight decay, called the Scheduled Weight Decay (SWD) method.
Our experiments also support that SWD indeed mitigates large gradient norms and often significantly outperforms the conventional constant weight decay strategy for Adaptive Moment Estimation (Adam)
arXiv Detail & Related papers (2020-11-23T00:39:49Z) - Neural networks with late-phase weights [66.72777753269658]
We show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning.
At the end of learning, we obtain back a single model by taking a spatial average in weight space.
arXiv Detail & Related papers (2020-07-25T13:23:37Z) - Overfitting in adversarially robust deep learning [86.11788847990783]
We show that overfitting to the training set does in fact harm robust performance to a very large degree in adversarially robust training.
We also show that effects such as the double descent curve do still occur in adversarially trained models, yet fail to explain the observed overfitting.
arXiv Detail & Related papers (2020-02-26T15:40:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.