Characterizing the Implicit Bias of Regularized SGD in Rank Minimization
- URL: http://arxiv.org/abs/2206.05794v6
- Date: Thu, 26 Oct 2023 03:20:30 GMT
- Title: Characterizing the Implicit Bias of Regularized SGD in Rank Minimization
- Authors: Tomer Galanti, Zachary S. Siegel, Aparna Gupte, Tomaso Poggio
- Abstract summary: We show that training neural networks with mini-batch SGD causes a bias towards rank minimization over the weight matrices.
Specifically, we show, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay.
We empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization.
- Score: 9.607159748020601
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the bias of Stochastic Gradient Descent (SGD) to learn low-rank
weight matrices when training deep neural networks. Our results show that
training neural networks with mini-batch SGD and weight decay causes a bias
towards rank minimization over the weight matrices. Specifically, we show, both
theoretically and empirically, that this bias is more pronounced when using
smaller batch sizes, higher learning rates, or increased weight decay.
Additionally, we predict and observe empirically that weight decay is necessary
to achieve this bias. Unlike previous literature, our analysis does not rely on
assumptions about the data, convergence, or optimality of the weight matrices
and applies to a wide range of neural network architectures of any width or
depth. Finally, we empirically investigate the connection between this bias and
generalization, finding that it has a marginal effect on generalization.
Related papers
- Towards Better Generalization: Weight Decay Induces Low-rank Bias for Neural Networks [9.948870430491738]
We study the implicit bias towards low-rank weight matrices when training neural networks with Weight Decay (WD)
Our work offers both theoretical and empirical insights into the strong generalization performance of SGD when combined with WD.
arXiv Detail & Related papers (2024-10-03T03:36:18Z) - From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories, and Applications [85.17672240603011]
We study the non-uniform low-rank properties of weight matrices in Large Language Models.<n>We present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning into one.
arXiv Detail & Related papers (2024-07-15T21:05:20Z) - Neural Rank Collapse: Weight Decay and Small Within-Class Variability
Yield Low-Rank Bias [4.829265670567825]
We show the presence of an intriguing neural rank collapse phenomenon, connecting the low-rank bias of trained networks with networks' neural collapse properties.
As the weight decay parameter grows, the rank of each layer in the network decreases proportionally to the within-class variability of the hidden-space embeddings of the previous layers.
arXiv Detail & Related papers (2024-02-06T13:44:39Z) - Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks [33.88586668321127]
This study investigates how weight decay affects the update behavior of individual neurons in deep neural networks.
We show that explicitly controlling the rotation provides the benefits of weight decay while substantially reducing the need for learning rate warmup.
arXiv Detail & Related papers (2023-05-26T19:14:01Z) - Long-Tailed Recognition via Weight Balancing [66.03068252811993]
Naive training produces models that are biased toward common classes in terms of higher accuracy.
We investigate three techniques to balance weights, L2-normalization, weight decay, and MaxNorm.
Our approach achieves the state-of-the-art accuracy on five standard benchmarks.
arXiv Detail & Related papers (2022-03-27T03:26:31Z) - Distribution of Classification Margins: Are All Data Equal? [61.16681488656473]
We motivate theoretically and show empirically that the area under the curve of the margin distribution on the training set is in fact a good measure of generalization.
The resulting subset of "high capacity" features is not consistent across different training runs.
arXiv Detail & Related papers (2021-07-21T16:41:57Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - FixNorm: Dissecting Weight Decay for Training Deep Neural Networks [7.820667552233989]
We propose a new training method called FixNorm, which discards weight decay and directly controls the two mechanisms.
On ImageNet classification task, training EfficientNet-B0 with FixNorm achieves 77.7%, which outperforms the original baseline by a clear margin.
arXiv Detail & Related papers (2021-03-29T05:41:56Z) - Neural networks with late-phase weights [66.72777753269658]
We show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning.
At the end of learning, we obtain back a single model by taking a spatial average in weight space.
arXiv Detail & Related papers (2020-07-25T13:23:37Z) - Spherical Motion Dynamics: Learning Dynamics of Neural Network with
Normalization, Weight Decay, and SGD [105.99301967452334]
We show the learning dynamics of neural network with normalization, weight decay (WD), and SGD (with momentum) named as Spherical Motion Dynamics (SMD)
We verify our assumptions and theoretical results on various computer vision tasks including ImageNet and MSCOCO with standard settings.
arXiv Detail & Related papers (2020-06-15T14:16:33Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.