Related papers: Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs

Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs

URL: http://arxiv.org/abs/2206.00939v1
Date: Thu, 2 Jun 2022 09:01:25 GMT
Title: Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs
Authors: Etienne Boursier and Loucas Pillaud-Vivien and Nicolas Flammarion
Abstract summary: The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution. This article presents the gradient flow dynamics of one neural network for the mean squared error at small initialisation.
Score: 19.401271427657395
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution. Yet, despite some recent progress, a complete theory explaining its success is still missing. This article presents, for orthogonal input vectors, a precise description of the gradient flow dynamics of training one-hidden layer ReLU neural networks for the mean squared error at small initialisation. In this setting, despite non-convexity, we show that the gradient flow converges to zero loss and characterise its implicit bias towards minimum variation norm. Furthermore, some interesting phenomena are highlighted: a quantitative description of the initial alignment phenomenon and a proof that the process follows a specific saddle to saddle dynamics.

Related papers

Towards Understanding Gradient Flow Dynamics of Homogeneous Neural Networks Beyond the Origin [1.9556053645976448]
Recent works have established that in the early stages of training, the weights remain small and near the origin, but converge in direction. This paper studies the gradient flow dynamics of homogeneous neural networks with locally Lipschitz gradients, after they escape the origin.
arXiv Detail & Related papers (2025-02-21T21:32:31Z)
Early Directional Convergence in Deep Homogeneous Neural Networks for Small Initializations [2.310288676109785]
This paper studies the gradient flow dynamics that arise when training deep homogeneous neural networks. The weights of the neural network remain small in norm and approximately converge in direction along the Karush-Kuhn-Tucker points.
arXiv Detail & Related papers (2024-03-12T23:17:32Z)
Early alignment in two-layer networks training is a two-edged sword [24.43739371803548]
Training neural networks with first order optimisation methods is at the core of the empirical success of deep learning. Small initialisations are generally associated to a feature learning regime, for which gradient descent is implicitly biased towards simple solutions. This work provides a general and quantitative description of the early alignment phase, originally introduced by Maennel et al.
arXiv Detail & Related papers (2024-01-19T16:23:53Z)
Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well. While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z)
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two. For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z)
Magnitude and Angle Dynamics in Training Single ReLU Neurons [45.886537625951256]
We decompose gradient flow $w(t)$ to magnitude $w(t)$ and angle $phi(t):= pi - theta(t) $ components. We find that small scale initialization induces slow convergence speed for deep single ReLU neurons.
arXiv Detail & Related papers (2022-09-27T13:58:46Z)
Support Vectors and Gradient Dynamics for Implicit Bias in ReLU Networks [45.886537625951256]
We study gradient flow dynamics in the parameter space when training single-neuron ReLU networks. Specifically, we discover implicit bias in terms of support vectors in ReLU networks, which play a key role in why and how ReLU networks generalize well.
arXiv Detail & Related papers (2022-02-11T08:55:58Z)
Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task. This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z)
Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy [71.25689267025244]
We show how the transition is controlled by the relationship between the scale and how accurately we minimize the training loss. Our results indicate that some limit behaviors of gradient descent only kick in at ridiculous training accuracies.
arXiv Detail & Related papers (2020-07-13T23:49:53Z)
Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
On the Convex Behavior of Deep Neural Networks in Relation to the Layers' Width [99.24399270311069]
We observe that for wider networks, minimizing the loss with the descent optimization maneuvers through surfaces of positive curvatures at the start and end of training, and close to zero curvatures in between. In other words, it seems that during crucial parts of the training process, the Hessian in wide networks is dominated by the component G.
arXiv Detail & Related papers (2020-01-14T16:30:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.