Deep Linear Networks Dynamics: Low-Rank Biases Induced by Initialization
Scale and L2 Regularization
- URL: http://arxiv.org/abs/2106.15933v1
- Date: Wed, 30 Jun 2021 09:34:05 GMT
- Title: Deep Linear Networks Dynamics: Low-Rank Biases Induced by Initialization
Scale and L2 Regularization
- Authors: Arthur Jacot, Fran\c{c}ois Ged, Franck Gabriel, Berfin \c{S}im\c{s}ek,
Cl\'ement Hongler
- Abstract summary: We investigate how the rank of the linear map found by gradient descent is affected by the addition of $L_2$ regularization on the parameters.
We show that adding a $L_p$-Schatten (quasi)norm on the parameters corresponds to the addition to the cost of a $L_p$-Schatten (quasi)norm on the linear map.
We numerically observe that these local minima can generalize better than global ones in some settings.
- Score: 9.799637101641151
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For deep linear networks (DLN), various hyperparameters alter the dynamics of
training dramatically. We investigate how the rank of the linear map found by
gradient descent is affected by (1) the initialization norm and (2) the
addition of $L_{2}$ regularization on the parameters. For (1), we study two
regimes: (1a) the linear/lazy regime, for large norm initialization; (1b) a
\textquotedbl saddle-to-saddle\textquotedbl{} regime for small initialization
norm. In the (1a) setting, the dynamics of a DLN of any depth is similar to
that of a standard linear model, without any low-rank bias. In the (1b)
setting, we conjecture that throughout training, gradient descent approaches a
sequence of saddles, each corresponding to linear maps of increasing rank,
until reaching a minimal rank global minimum. We support this conjecture with a
partial proof and some numerical experiments. For (2), we show that adding a
$L_{2}$ regularization on the parameters corresponds to the addition to the
cost of a $L_{p}$-Schatten (quasi)norm on the linear map with $p=\frac{2}{L}$
(for a depth-$L$ network), leading to a stronger low-rank bias as $L$ grows.
The effect of $L_{2}$ regularization on the loss surface depends on the depth:
for shallow networks, all critical points are either strict saddles or global
minima, whereas for deep networks, some local minima appear. We numerically
observe that these local minima can generalize better than global ones in some
settings.
Related papers
- Linear regression with overparameterized linear neural networks: Tight upper and lower bounds for implicit $\ell^1$-regularization [3.902441198412341]
We study implicit regularization in diagonal linear neural networks of depth $Dge 2$ for overparameterized linear regression problems.<n>Our results reveal a qualitative difference between depths: for $D ge 3$, the error decreases linearly with $alpha$, whereas for $D=2$, it decreases at rate $alpha1-varrho$.<n> Numerical experiments corroborate our theoretical findings and suggest that deeper networks, i.e., $D ge 3$, may lead to better generalization.
arXiv Detail & Related papers (2025-06-01T19:55:31Z) - Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes [29.466981306355066]
We show that gradient descent with a fixed learning rate $eta$ can only find local minima that represent smooth functions.
We also prove a nearly-optimal MSE bound of $widetildeO(n-4/5)$ within the strict interior of the support of the $n$ data points.
arXiv Detail & Related papers (2024-06-10T22:57:27Z) - Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - Global $\mathcal{L}^2$ minimization at uniform exponential rate via geometrically adapted gradient descent in Deep Learning [1.4050802766699084]
We consider the scenario of supervised learning in Deep Learning (DL) networks.
We choose the gradient flow with respect to the Euclidean metric in the output layer of the DL network.
arXiv Detail & Related papers (2023-11-27T02:12:02Z) - The Inductive Bias of Flatness Regularization for Deep Matrix
Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks.
We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z) - Greedy Pruning with Group Lasso Provably Generalizes for Matrix Sensing [30.508036898655114]
Pruning schemes have been widely used in practice to reduce the complexity of trained models with a massive number of parameters.
Running gradient descent in the absence of regularization results in models which are not suitable for greedy pruning, i.e., many columns could have their $ell$ norm comparable to that of the maximum.
Our results provide the first rigorous insights on why greedy pruning + fine-tuning leads to smaller models which also generalize well.
arXiv Detail & Related papers (2023-03-20T21:05:44Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Gradient Descent Optimizes Infinite-Depth ReLU Implicit Networks with
Linear Widths [25.237054775800164]
This paper studies the convergence of gradient flow and gradient descent for nonlinear ReLU activated implicit networks.
We prove that both GF and GD converge to a global minimum at a linear rate if the width $m$ of the implicit network is textitlinear in the sample size.
arXiv Detail & Related papers (2022-05-16T06:07:56Z) - High-dimensional Asymptotics of Feature Learning: How One Gradient Step
Improves the Representation [89.21686761957383]
We study the first gradient descent step on the first-layer parameters $boldsymbolW$ in a two-layer network.
Our results demonstrate that even one step can lead to a considerable advantage over random features.
arXiv Detail & Related papers (2022-05-03T12:09:59Z) - Implicit Regularization Towards Rank Minimization in ReLU Networks [34.41953136999683]
We study the conjectured relationship between the implicit regularization in neural networks and rank minimization.
We focus on nonlinear ReLU networks, providing several new positive and negative results.
arXiv Detail & Related papers (2022-01-30T09:15:44Z) - Leveraging Non-uniformity in First-order Non-convex Optimization [93.6817946818977]
Non-uniform refinement of objective functions leads to emphNon-uniform Smoothness (NS) and emphNon-uniform Lojasiewicz inequality (NL)
New definitions inspire new geometry-aware first-order methods that converge to global optimality faster than the classical $Omega (1/t2)$ lower bounds.
arXiv Detail & Related papers (2021-05-13T04:23:07Z) - Beyond Lazy Training for Over-parameterized Tensor Decomposition [69.4699995828506]
We show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
arXiv Detail & Related papers (2020-10-22T00:32:12Z) - Implicit Bias in Deep Linear Classification: Initialization Scale vs
Training Accuracy [71.25689267025244]
We show how the transition is controlled by the relationship between the scale and how accurately we minimize the training loss.
Our results indicate that some limit behaviors of gradient descent only kick in at ridiculous training accuracies.
arXiv Detail & Related papers (2020-07-13T23:49:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.