On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A
Gradient-Norm Perspective
- URL: http://arxiv.org/abs/2011.11152v5
- Date: Fri, 20 Oct 2023 03:03:39 GMT
- Title: On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A
Gradient-Norm Perspective
- Authors: Zeke Xie, Zhiqiang Xu, Jingzhao Zhang, Issei Sato, Masashi Sugiyama
- Abstract summary: We present the first practical scheduler for weight decay, called the Scheduled Weight Decay (SWD) method.
Our experiments also support that SWD indeed mitigates large gradient norms and often significantly outperforms the conventional constant weight decay strategy for Adaptive Moment Estimation (Adam)
- Score: 96.97587309301719
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weight decay is a simple yet powerful regularization technique that has been
very widely used in training of deep neural networks (DNNs). While weight decay
has attracted much attention, previous studies fail to discover some overlooked
pitfalls on large gradient norms resulted by weight decay. In this paper, we
discover that, weight decay can unfortunately lead to large gradient norms at
the final phase (or the terminated solution) of training, which often indicates
bad convergence and poor generalization. To mitigate the gradient-norm-centered
pitfalls, we present the first practical scheduler for weight decay, called the
Scheduled Weight Decay (SWD) method that can dynamically adjust the weight
decay strength according to the gradient norm and significantly penalize large
gradient norms during training. Our experiments also support that SWD indeed
mitigates large gradient norms and often significantly outperforms the
conventional constant weight decay strategy for Adaptive Moment Estimation
(Adam).
Related papers
- Why Do We Need Weight Decay in Modern Deep Learning? [27.110071835818808]
Weight decay is a technique for training state-of-the-art deep networks, including large language models.
In this work, we highlight that the role of weight decay in modern deep learning is different from its regularization effect studied in classical learning theory.
We show how weight decay modifies the optimization dynamics enhancing the ever-present implicit regularization of SGD.
arXiv Detail & Related papers (2023-10-06T17:58:21Z) - FedNAR: Federated Optimization with Normalized Annealing Regularization [54.42032094044368]
We explore the choices of weight decay and identify that weight decay value appreciably influences the convergence of existing FL algorithms.
We develop Federated optimization with Normalized Annealing Regularization (FedNAR), a plug-in that can be seamlessly integrated into any existing FL algorithms.
arXiv Detail & Related papers (2023-10-04T21:11:40Z) - PathProx: A Proximal Gradient Algorithm for Weight Decay Regularized
Deep Neural Networks [25.114642281756495]
Weight decay is one of the most widely used forms of regularization in deep learning.
This paper argues that gradient descent may be an inefficient algorithm for this objective.
For neural networks with ReLU activations, solutions to the weight decay objective are equivalent to those of a different objective.
arXiv Detail & Related papers (2022-10-06T17:22:40Z) - Characterizing the Implicit Bias of Regularized SGD in Rank Minimization [9.607159748020601]
We show that training neural networks with mini-batch SGD causes a bias towards rank minimization over the weight matrices.
Specifically, we show, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay.
We empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization.
arXiv Detail & Related papers (2022-06-12T17:06:35Z) - Long-Tailed Recognition via Weight Balancing [66.03068252811993]
Naive training produces models that are biased toward common classes in terms of higher accuracy.
We investigate three techniques to balance weights, L2-normalization, weight decay, and MaxNorm.
Our approach achieves the state-of-the-art accuracy on five standard benchmarks.
arXiv Detail & Related papers (2022-03-27T03:26:31Z) - The Implicit Biases of Stochastic Gradient Descent on Deep Neural
Networks with Batch Normalization [44.30960913470372]
Deep neural networks with batch normalization (BN-DNNs) are invariant to weight rescaling due to their normalization operations.
We investigate the implicit biases of gradient descent (SGD) on BN-DNNs to provide a theoretical explanation for the efficacy of weight decay.
arXiv Detail & Related papers (2021-02-06T03:40:20Z) - Explicit regularization and implicit bias in deep network classifiers
trained with the square loss [2.8935588665357077]
Deep ReLU networks trained with the square loss have been observed to perform well in classification tasks.
We show that convergence to a solution with the absolute minimum norm is expected when normalization techniques are used together with Weight Decay.
arXiv Detail & Related papers (2020-12-31T21:07:56Z) - Improve Generalization and Robustness of Neural Networks via Weight
Scale Shifting Invariant Regularizations [52.493315075385325]
We show that a family of regularizers, including weight decay, is ineffective at penalizing the intrinsic norms of weights for networks with homogeneous activation functions.
We propose an improved regularizer that is invariant to weight scale shifting and thus effectively constrains the intrinsic norm of a neural network.
arXiv Detail & Related papers (2020-08-07T02:55:28Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.