SGD and Weight Decay Secretly Minimize the Rank of Your Neural Network
- URL: http://arxiv.org/abs/2206.05794v7
- Date: Fri, 18 Oct 2024 21:32:39 GMT
- Title: SGD and Weight Decay Secretly Minimize the Rank of Your Neural Network
- Authors: Tomer Galanti, Zachary S. Siegel, Aparna Gupte, Tomaso Poggio,
- Abstract summary: Training with mini-batch SGD and weight decay induces a bias toward rank minimization in weight matrices.
We show that this bias becomes more pronounced with smaller batch sizes, higher learning rates, or stronger weight decay.
We empirically explore the connection between this bias and generalization, finding that it has a marginal effect on the test performance.
- Score: 8.79431718760617
- License:
- Abstract: We investigate the inherent bias of Stochastic Gradient Descent (SGD) toward learning low-rank weight matrices during the training of deep neural networks. Our results demonstrate that training with mini-batch SGD and weight decay induces a bias toward rank minimization in the weight matrices. Specifically, we show both theoretically and empirically that this bias becomes more pronounced with smaller batch sizes, higher learning rates, or stronger weight decay. Additionally, we predict and empirically confirm that weight decay is essential for this bias to occur. Unlike previous literature, our analysis does not rely on assumptions about the data, convergence, or optimality of the weight matrices, making it applicable to a wide range of neural network architectures of any width or depth. Finally, we empirically explore the connection between this bias and generalization, finding that it has a marginal effect on the test performance.
Related papers
- Towards Better Generalization: Weight Decay Induces Low-rank Bias for Neural Networks [9.948870430491738]
We study the implicit bias towards low-rank weight matrices when training neural networks with Weight Decay (WD)
Our work offers both theoretical and empirical insights into the strong generalization performance of SGD when combined with WD.
arXiv Detail & Related papers (2024-10-03T03:36:18Z) - Neural Rank Collapse: Weight Decay and Small Within-Class Variability
Yield Low-Rank Bias [4.829265670567825]
We show the presence of an intriguing neural rank collapse phenomenon, connecting the low-rank bias of trained networks with networks' neural collapse properties.
As the weight decay parameter grows, the rank of each layer in the network decreases proportionally to the within-class variability of the hidden-space embeddings of the previous layers.
arXiv Detail & Related papers (2024-02-06T13:44:39Z) - Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks [33.88586668321127]
This study investigates how weight decay affects the update behavior of individual neurons in deep neural networks.
We show that explicitly controlling the rotation provides the benefits of weight decay while substantially reducing the need for learning rate warmup.
arXiv Detail & Related papers (2023-05-26T19:14:01Z) - Long-Tailed Recognition via Weight Balancing [66.03068252811993]
Naive training produces models that are biased toward common classes in terms of higher accuracy.
We investigate three techniques to balance weights, L2-normalization, weight decay, and MaxNorm.
Our approach achieves the state-of-the-art accuracy on five standard benchmarks.
arXiv Detail & Related papers (2022-03-27T03:26:31Z) - Distribution of Classification Margins: Are All Data Equal? [61.16681488656473]
We motivate theoretically and show empirically that the area under the curve of the margin distribution on the training set is in fact a good measure of generalization.
The resulting subset of "high capacity" features is not consistent across different training runs.
arXiv Detail & Related papers (2021-07-21T16:41:57Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - FixNorm: Dissecting Weight Decay for Training Deep Neural Networks [7.820667552233989]
We propose a new training method called FixNorm, which discards weight decay and directly controls the two mechanisms.
On ImageNet classification task, training EfficientNet-B0 with FixNorm achieves 77.7%, which outperforms the original baseline by a clear margin.
arXiv Detail & Related papers (2021-03-29T05:41:56Z) - Neural networks with late-phase weights [66.72777753269658]
We show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning.
At the end of learning, we obtain back a single model by taking a spatial average in weight space.
arXiv Detail & Related papers (2020-07-25T13:23:37Z) - Spherical Motion Dynamics: Learning Dynamics of Neural Network with
Normalization, Weight Decay, and SGD [105.99301967452334]
We show the learning dynamics of neural network with normalization, weight decay (WD), and SGD (with momentum) named as Spherical Motion Dynamics (SMD)
We verify our assumptions and theoretical results on various computer vision tasks including ImageNet and MSCOCO with standard settings.
arXiv Detail & Related papers (2020-06-15T14:16:33Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.