Gradient flow dynamics of shallow ReLU networks for square loss and
orthogonal inputs
- URL: http://arxiv.org/abs/2206.00939v1
- Date: Thu, 2 Jun 2022 09:01:25 GMT
- Title: Gradient flow dynamics of shallow ReLU networks for square loss and
orthogonal inputs
- Authors: Etienne Boursier and Loucas Pillaud-Vivien and Nicolas Flammarion
- Abstract summary: The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution.
This article presents the gradient flow dynamics of one neural network for the mean squared error at small initialisation.
- Score: 19.401271427657395
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The training of neural networks by gradient descent methods is a cornerstone
of the deep learning revolution. Yet, despite some recent progress, a complete
theory explaining its success is still missing. This article presents, for
orthogonal input vectors, a precise description of the gradient flow dynamics
of training one-hidden layer ReLU neural networks for the mean squared error at
small initialisation. In this setting, despite non-convexity, we show that the
gradient flow converges to zero loss and characterise its implicit bias towards
minimum variation norm. Furthermore, some interesting phenomena are
highlighted: a quantitative description of the initial alignment phenomenon and
a proof that the process follows a specific saddle to saddle dynamics.
Related papers
- Early Directional Convergence in Deep Homogeneous Neural Networks for
Small Initializations [2.310288676109785]
This paper studies the gradient flow dynamics that arise when training deep homogeneous neural networks.
The weights of the neural network remain small in norm and approximately converge in direction along the Karush-Kuhn-Tucker points.
arXiv Detail & Related papers (2024-03-12T23:17:32Z) - Early alignment in two-layer networks training is a two-edged sword [24.43739371803548]
Training neural networks with first order optimisation methods is at the core of the empirical success of deep learning.
Small initialisations are generally associated to a feature learning regime, for which gradient descent is implicitly biased towards simple solutions.
This work provides a general and quantitative description of the early alignment phase, originally introduced by Maennel et al.
arXiv Detail & Related papers (2024-01-19T16:23:53Z) - Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Magnitude and Angle Dynamics in Training Single ReLU Neurons [45.886537625951256]
We decompose gradient flow $w(t)$ to magnitude $w(t)$ and angle $phi(t):= pi - theta(t) $ components.
We find that small scale initialization induces slow convergence speed for deep single ReLU neurons.
arXiv Detail & Related papers (2022-09-27T13:58:46Z) - Support Vectors and Gradient Dynamics for Implicit Bias in ReLU Networks [45.886537625951256]
We study gradient flow dynamics in the parameter space when training single-neuron ReLU networks.
Specifically, we discover implicit bias in terms of support vectors in ReLU networks, which play a key role in why and how ReLU networks generalize well.
arXiv Detail & Related papers (2022-02-11T08:55:58Z) - Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task.
This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z) - Implicit Bias in Deep Linear Classification: Initialization Scale vs
Training Accuracy [71.25689267025244]
We show how the transition is controlled by the relationship between the scale and how accurately we minimize the training loss.
Our results indicate that some limit behaviors of gradient descent only kick in at ridiculous training accuracies.
arXiv Detail & Related papers (2020-07-13T23:49:53Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z) - On the Convex Behavior of Deep Neural Networks in Relation to the
Layers' Width [99.24399270311069]
We observe that for wider networks, minimizing the loss with the descent optimization maneuvers through surfaces of positive curvatures at the start and end of training, and close to zero curvatures in between.
In other words, it seems that during crucial parts of the training process, the Hessian in wide networks is dominated by the component G.
arXiv Detail & Related papers (2020-01-14T16:30:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.