PathProx: A Proximal Gradient Algorithm for Weight Decay Regularized
Deep Neural Networks
- URL: http://arxiv.org/abs/2210.03069v4
- Date: Wed, 5 Jul 2023 19:15:34 GMT
- Title: PathProx: A Proximal Gradient Algorithm for Weight Decay Regularized
Deep Neural Networks
- Authors: Liu Yang, Jifan Zhang, Joseph Shenouda, Dimitris Papailiopoulos,
Kangwook Lee, Robert D. Nowak
- Abstract summary: Weight decay is one of the most widely used forms of regularization in deep learning.
This paper argues that gradient descent may be an inefficient algorithm for this objective.
For neural networks with ReLU activations, solutions to the weight decay objective are equivalent to those of a different objective.
- Score: 25.114642281756495
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weight decay is one of the most widely used forms of regularization in deep
learning, and has been shown to improve generalization and robustness. The
optimization objective driving weight decay is a sum of losses plus a term
proportional to the sum of squared weights. This paper argues that stochastic
gradient descent (SGD) may be an inefficient algorithm for this objective. For
neural networks with ReLU activations, solutions to the weight decay objective
are equivalent to those of a different objective in which the regularization
term is instead a sum of products of $\ell_2$ (not squared) norms of the input
and output weights associated with each ReLU neuron. This alternative (and
effectively equivalent) regularization suggests a novel proximal gradient
algorithm for network training. Theory and experiments support the new training
approach, showing that it can converge much faster to the sparse solutions it
shares with standard weight decay training.
Related papers
- Optimization and Generalization Guarantees for Weight Normalization [19.965963460750206]
We provide the first theoretical characterizations of both optimization and generalization of deep WeightNorm models.
We present experimental results which illustrate how the normalization terms and other quantities of theoretical interest relate to the training of WeightNorm networks.
arXiv Detail & Related papers (2024-09-13T15:55:05Z) - Decoupled Weight Decay for Any $p$ Norm [1.1510009152620668]
We consider a simple yet effective approach to sparsification, based on the Bridge, $L_p$ regularization during training.
We introduce a novel weight decay scheme, which generalizes the standard $L$ weight decay to any $p$ norm.
We empirically demonstrate that it leads to highly sparse networks, while maintaining performance comparable to standard $L$ regularization.
arXiv Detail & Related papers (2024-04-16T18:02:15Z) - FedNAR: Federated Optimization with Normalized Annealing Regularization [54.42032094044368]
We explore the choices of weight decay and identify that weight decay value appreciably influences the convergence of existing FL algorithms.
We develop Federated optimization with Normalized Annealing Regularization (FedNAR), a plug-in that can be seamlessly integrated into any existing FL algorithms.
arXiv Detail & Related papers (2023-10-04T21:11:40Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization.
We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z) - The Implicit Biases of Stochastic Gradient Descent on Deep Neural
Networks with Batch Normalization [44.30960913470372]
Deep neural networks with batch normalization (BN-DNNs) are invariant to weight rescaling due to their normalization operations.
We investigate the implicit biases of gradient descent (SGD) on BN-DNNs to provide a theoretical explanation for the efficacy of weight decay.
arXiv Detail & Related papers (2021-02-06T03:40:20Z) - On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective [90.39123717733334]
We present the first practical scheduler for weight decay, called the Scheduled Weight Decay (SWD) method.
Our experiments also support that SWD indeed mitigates large gradient norms and often significantly outperforms the conventional constant weight decay strategy for Adaptive Moment Estimation (Adam)
arXiv Detail & Related papers (2020-11-23T00:39:49Z) - Neural networks with late-phase weights [66.72777753269658]
We show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning.
At the end of learning, we obtain back a single model by taking a spatial average in weight space.
arXiv Detail & Related papers (2020-07-25T13:23:37Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.