Gradient descent provably escapes saddle points in the training of
shallow ReLU networks
- URL: http://arxiv.org/abs/2208.02083v1
- Date: Wed, 3 Aug 2022 14:08:52 GMT
- Title: Gradient descent provably escapes saddle points in the training of
shallow ReLU networks
- Authors: Patrick Cheridito, Arnulf Jentzen, Florian Rossmannek
- Abstract summary: We prove a variant of the relevant dynamical systems result, a center-stable manifold theorem, in which we relax some regularity requirements.
Building on a classification of critical points of the square integral loss of shallow ReLU networks measured against an affine target function, we deduce that gradient descent avoids most saddle points.
- Score: 3.0079490585515343
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dynamical systems theory has recently been applied in optimization to prove
that gradient descent algorithms avoid so-called strict saddle points of the
loss function. However, in many modern machine learning applications, the
required regularity conditions are not satisfied. In particular, this is the
case for rectified linear unit (ReLU) networks. In this paper, we prove a
variant of the relevant dynamical systems result, a center-stable manifold
theorem, in which we relax some of the regularity requirements. Then, we verify
that shallow ReLU networks fit into the new framework. Building on a
classification of critical points of the square integral loss of shallow ReLU
networks measured against an affine target function, we deduce that gradient
descent avoids most saddle points. We proceed to prove convergence to global
minima if the initialization is sufficiently good, which is expressed by an
explicit threshold on the limiting loss.
Related papers
- On the Stability of Gradient Descent for Large Learning Rate [62.19241612132701]
Edge of Stability (EoS) has been observed in neural networks training, characterized by a non-monotonic decrease of the loss function over epochs.
We show that linear neural networks optimized under a quadratic loss function satisfy the first assumption and also a necessary condition for the second assumption.
arXiv Detail & Related papers (2024-02-20T16:01:42Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Implicit regularization in AI meets generalized hardness of
approximation in optimization -- Sharp results for diagonal linear networks [0.0]
We show sharp results for the implicit regularization imposed by the gradient flow of Diagonal Linear Networks.
We link this to the phenomenon of phase transitions in generalized hardness of approximation.
Non-sharpness of our results would imply that the GHA phenomenon would not occur for the basis pursuit optimization problem.
arXiv Detail & Related papers (2023-07-13T13:27:51Z) - The Implicit Bias of Minima Stability in Multivariate Shallow ReLU
Networks [53.95175206863992]
We study the type of solutions to which gradient descent converges when used to train a single hidden-layer multivariate ReLU network with the quadratic loss.
We prove that although shallow ReLU networks are universal approximators, stable shallow networks are not.
arXiv Detail & Related papers (2023-06-30T09:17:39Z) - Implicit Regularization for Group Sparsity [33.487964460794764]
We show that gradient descent over the squared regression loss, without any explicit regularization, biases towards solutions with a group sparsity structure.
We analyze the gradient dynamics of the corresponding regression problem in the general noise setting and obtain minimax-optimal error rates.
In the degenerate case of size-one groups, our approach gives rise to a new algorithm for sparse linear regression.
arXiv Detail & Related papers (2023-01-29T20:54:03Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Support Vectors and Gradient Dynamics for Implicit Bias in ReLU Networks [45.886537625951256]
We study gradient flow dynamics in the parameter space when training single-neuron ReLU networks.
Specifically, we discover implicit bias in terms of support vectors in ReLU networks, which play a key role in why and how ReLU networks generalize well.
arXiv Detail & Related papers (2022-02-11T08:55:58Z) - Improved Overparametrization Bounds for Global Convergence of Stochastic
Gradient Descent for Shallow Neural Networks [1.14219428942199]
We study the overparametrization bounds required for the global convergence of gradient descent algorithm for a class of one hidden layer feed-forward neural networks.
arXiv Detail & Related papers (2022-01-28T11:30:06Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.