On the Explicit Role of Initialization on the Convergence and Implicit
Bias of Overparametrized Linear Networks
- URL: http://arxiv.org/abs/2105.06351v1
- Date: Thu, 13 May 2021 15:13:51 GMT
- Title: On the Explicit Role of Initialization on the Convergence and Implicit
Bias of Overparametrized Linear Networks
- Authors: Hancheng Min, Salma Tarmoun, Rene Vidal, Enrique Mallada
- Abstract summary: We present a novel analysis of single-hidden-layer linear networks trained under gradient flow.
We show that the squared loss converges exponentially to its optimum.
We derive a novel non-asymptotic upper-bound on the distance between the trained network and the min-norm solution.
- Score: 1.0323063834827415
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural networks trained via gradient descent with random initialization and
without any regularization enjoy good generalization performance in practice
despite being highly overparametrized. A promising direction to explain this
phenomenon is to study how initialization and overparametrization affect
convergence and implicit bias of training algorithms. In this paper, we present
a novel analysis of single-hidden-layer linear networks trained under gradient
flow, which connects initialization, optimization, and overparametrization.
Firstly, we show that the squared loss converges exponentially to its optimum
at a rate that depends on the level of imbalance of the initialization.
Secondly, we show that proper initialization constrains the dynamics of the
network parameters to lie within an invariant set. In turn, minimizing the loss
over this set leads to the min-norm solution. Finally, we show that large
hidden layer width, together with (properly scaled) random initialization,
ensures proximity to such an invariant set during training, allowing us to
derive a novel non-asymptotic upper-bound on the distance between the trained
network and the min-norm solution.
Related papers
- Early alignment in two-layer networks training is a two-edged sword [24.43739371803548]
Training neural networks with first order optimisation methods is at the core of the empirical success of deep learning.
Small initialisations are generally associated to a feature learning regime, for which gradient descent is implicitly biased towards simple solutions.
This work provides a general and quantitative description of the early alignment phase, originally introduced by Maennel et al.
arXiv Detail & Related papers (2024-01-19T16:23:53Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Implicit regularization in AI meets generalized hardness of
approximation in optimization -- Sharp results for diagonal linear networks [0.0]
We show sharp results for the implicit regularization imposed by the gradient flow of Diagonal Linear Networks.
We link this to the phenomenon of phase transitions in generalized hardness of approximation.
Non-sharpness of our results would imply that the GHA phenomenon would not occur for the basis pursuit optimization problem.
arXiv Detail & Related papers (2023-07-13T13:27:51Z) - On the Effect of Initialization: The Scaling Path of 2-Layer Neural
Networks [21.69222364939501]
In supervised learning, the regularization path is sometimes used as a convenient theoretical proxy for the optimization path of gradient descent from zero.
We show that the path interpolates continuously between the so-called kernel and rich regimes.
arXiv Detail & Related papers (2023-03-31T05:32:11Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Improved Overparametrization Bounds for Global Convergence of Stochastic
Gradient Descent for Shallow Neural Networks [1.14219428942199]
We study the overparametrization bounds required for the global convergence of gradient descent algorithm for a class of one hidden layer feed-forward neural networks.
arXiv Detail & Related papers (2022-01-28T11:30:06Z) - Subquadratic Overparameterization for Shallow Neural Networks [60.721751363271146]
We provide an analytical framework that allows us to adopt standard neural training strategies.
We achieve the desiderata viaak-Lojasiewicz, smoothness, and standard assumptions.
arXiv Detail & Related papers (2021-11-02T20:24:01Z) - Path Regularization: A Convexity and Sparsity Inducing Regularization
for Parallel ReLU Networks [75.33431791218302]
We study the training problem of deep neural networks and introduce an analytic approach to unveil hidden convexity in the optimization landscape.
We consider a deep parallel ReLU network architecture, which also includes standard deep networks and ResNets as its special cases.
arXiv Detail & Related papers (2021-10-18T18:00:36Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - MSE-Optimal Neural Network Initialization via Layer Fusion [68.72356718879428]
Deep neural networks achieve state-of-the-art performance for a range of classification and inference tasks.
The use of gradient combined nonvolutionity renders learning susceptible to novel problems.
We propose fusing neighboring layers of deeper networks that are trained with random variables.
arXiv Detail & Related papers (2020-01-28T18:25:15Z) - Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear
Networks [39.856439772974454]
We show that the width needed for efficient convergence to a global minimum is independent of the depth.
Our results suggest an explanation for the recent empirical successes found by initializing very deep non-linear networks according to the principle of dynamical isometry.
arXiv Detail & Related papers (2020-01-16T18:48:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.