Robust Implicit Regularization via Weight Normalization
- URL: http://arxiv.org/abs/2305.05448v3
- Date: Fri, 23 Feb 2024 07:20:33 GMT
- Title: Robust Implicit Regularization via Weight Normalization
- Authors: Hung-Hsu Chou, Holger Rauhut, Rachel Ward
- Abstract summary: We show that weight normalization enables a robust bias that persists even if the weights are at practically large scale.
Experiments suggest that the gains in both convergence speed and robustness of the implicit bias are improved dramatically by using weight normalization.
- Score: 6.042206709451915
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Overparameterized models may have many interpolating solutions; implicit
regularization refers to the hidden preference of a particular optimization
method towards a certain interpolating solution among the many. A by now
established line of work has shown that (stochastic) gradient descent tends to
have an implicit bias towards low rank and/or sparse solutions when used to
train deep linear networks, explaining to some extent why overparameterized
neural network models trained by gradient descent tend to have good
generalization performance in practice.However, existing theory for square-loss
objectives often requires very small initialization of the trainable weights,
which is at odds with the larger scale at which weights are initialized in
practice for faster convergence and better generalization performance. In this
paper, we aim to close this gap by incorporating and analyzing gradient flow
(continuous-time version of gradient descent) with weight normalization, where
the weight vector is reparameterized in terms of polar coordinates, and
gradient flow is applied to the polar coordinates. By analyzing key invariants
of the gradient flow and using Lojasiewicz Theorem, we show that weight
normalization also has an implicit bias towards sparse solutions in the
diagonal linear model, but that in contrast to plain gradient flow, weight
normalization enables a robust bias that persists even if the weights are
initialized at practically large scale. Experiments suggest that the gains in
both convergence speed and robustness of the implicit bias are improved
dramatically by using weight normalization in overparameterized diagonal linear
network models.
Related papers
- Distributed Momentum Methods Under Biased Gradient Estimations [6.046591474843391]
Distributed gradient methods are gaining prominence in solving large-scale machine learning problems that involve data distributed across multiple nodes.
However, obtaining unbiased gradient estimates is challenging in many distributed machine learning applications.
In this paper we establish non-asymotic convergence bounds on distributed momentum methods under biased gradient estimation.
arXiv Detail & Related papers (2024-02-29T18:03:03Z) - Implicit regularization in AI meets generalized hardness of
approximation in optimization -- Sharp results for diagonal linear networks [0.0]
We show sharp results for the implicit regularization imposed by the gradient flow of Diagonal Linear Networks.
We link this to the phenomenon of phase transitions in generalized hardness of approximation.
Non-sharpness of our results would imply that the GHA phenomenon would not occur for the basis pursuit optimization problem.
arXiv Detail & Related papers (2023-07-13T13:27:51Z) - The Inductive Bias of Flatness Regularization for Deep Matrix
Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks.
We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z) - The Implicit Bias of Batch Normalization in Linear Models and Two-layer
Linear Convolutional Neural Networks [117.93273337740442]
We show that gradient descent converges to a uniform margin classifier on the training data with an $exp(-Omega(log2 t))$ convergence rate.
We also show that batch normalization has an implicit bias towards a patch-wise uniform margin.
arXiv Detail & Related papers (2023-06-20T16:58:00Z) - Sharper analysis of sparsely activated wide neural networks with
trainable biases [103.85569570164404]
This work studies training one-hidden-layer overparameterized ReLU networks via gradient descent in the neural tangent kernel (NTK) regime.
Surprisingly, it is shown that the network after sparsification can achieve as fast convergence as the original network.
Since the generalization bound has dependence on the smallest eigenvalue of the limiting NTK, this work further studies the least eigenvalue of the limiting NTK.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - Border Basis Computation with Gradient-Weighted Norm [5.863264019032882]
We propose gradient-weighted normalization for the approximate border basis of vanishing ideals.
With a slight modification, the analysis of algorithms with coefficient normalization still works with gradient-weighted normalization.
arXiv Detail & Related papers (2021-01-02T08:29:51Z) - Inductive Bias of Gradient Descent for Exponentially Weight Normalized
Smooth Homogeneous Neural Nets [1.7259824817932292]
We analyze the inductive bias of gradient descent for weight normalized smooth homogeneous neural nets, when trained on exponential or cross-entropy loss.
This paper shows that the gradient flow path with EWN is equivalent to gradient flow on standard networks with an adaptive learning rate.
arXiv Detail & Related papers (2020-10-24T14:34:56Z) - Implicit Gradient Regularization [18.391141066502644]
gradient descent can be surprisingly good at optimizing deep neural networks without overfitting and without explicit regularization.
We call this Implicit Gradient Regularization (IGR) and we use backward error analysis to calculate the size of this regularization.
arXiv Detail & Related papers (2020-09-23T14:17:53Z) - Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.