Understanding Edge-of-Stability Training Dynamics with a Minimalist
Example
- URL: http://arxiv.org/abs/2210.03294v1
- Date: Fri, 7 Oct 2022 02:57:05 GMT
- Title: Understanding Edge-of-Stability Training Dynamics with a Minimalist
Example
- Authors: Xingyu Zhu, Zixuan Wang, Xiang Wang, Mo Zhou, Rong Ge
- Abstract summary: Recently, researchers observed that descent for deep neural networks operates in an edge-of-stability'' (EoS) regime.
We give rigorous analysis for its dynamics in a large local region and explain why the final converging point has sharpness to $2/eta$.
- Score: 20.714857891192345
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, researchers observed that gradient descent for deep neural networks
operates in an ``edge-of-stability'' (EoS) regime: the sharpness (maximum
eigenvalue of the Hessian) is often larger than stability threshold 2/$\eta$
(where $\eta$ is the step size). Despite this, the loss oscillates and
converges in the long run, and the sharpness at the end is just slightly below
$2/\eta$. While many other well-understood nonconvex objectives such as matrix
factorization or two-layer networks can also converge despite large sharpness,
there is often a larger gap between sharpness of the endpoint and $2/\eta$. In
this paper, we study EoS phenomenon by constructing a simple function that has
the same behavior. We give rigorous analysis for its training dynamics in a
large local region and explain why the final converging point has sharpness
close to $2/\eta$. Globally we observe that the training dynamics for our
example has an interesting bifurcating behavior, which was also observed in the
training of neural nets.
Related papers
- When Expressivity Meets Trainability: Fewer than $n$ Neurons Can Work [59.29606307518154]
We show that as long as the width $m geq 2n/d$ (where $d$ is the input dimension), its expressivity is strong, i.e., there exists at least one global minimizer with zero training loss.
We also consider a constrained optimization formulation where the feasible region is the nice local region, and prove that every KKT point is a nearly global minimizer.
arXiv Detail & Related papers (2022-10-21T14:41:26Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Analyzing Sharpness along GD Trajectory: Progressive Sharpening and Edge
of Stability [8.492339290649031]
This paper aims to analyze the GD dynamics and the sharpness along the optimization trajectory.
We empirically identify the norm of output layer weight as an interesting indicator of sharpness dynamics.
We provide a theoretical proof of the sharpness behavior in EOS regime in two-layer fully-connected linear neural networks.
arXiv Detail & Related papers (2022-07-26T06:37:58Z) - Benign Overfitting in Two-layer Convolutional Neural Networks [90.75603889605043]
We study the benign overfitting phenomenon in training a two-layer convolutional neural network (CNN)
We show that when the signal-to-noise ratio satisfies a certain condition, a two-layer CNN trained by gradient descent can achieve arbitrarily small training and test loss.
On the other hand, when this condition does not hold, overfitting becomes harmful and the obtained CNN can only achieve constant level test loss.
arXiv Detail & Related papers (2022-02-14T07:45:51Z) - Global Convergence Analysis of Deep Linear Networks with A One-neuron
Layer [18.06634056613645]
We consider optimizing deep linear networks which have a layer with one neuron under quadratic loss.
We describe the convergent point of trajectories with arbitrary starting point under flow.
We show specific convergence rates of trajectories that converge to the global gradientr by stages.
arXiv Detail & Related papers (2022-01-08T04:44:59Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - A Geometric Analysis of Neural Collapse with Unconstrained Features [40.66585948844492]
We provide the first global optimization landscape analysis of $Neural;Collapse$.
This phenomenon arises in the last-layer classifiers and features of neural networks during the terminal phase of training.
arXiv Detail & Related papers (2021-05-06T00:00:50Z) - Gradient Descent on Neural Networks Typically Occurs at the Edge of
Stability [94.4070247697549]
Full-batch gradient descent on neural network training objectives operates in a regime we call the Edge of Stability.
In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 / text(step size)$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales.
arXiv Detail & Related papers (2021-02-26T22:08:19Z) - Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks
Trained with the Logistic Loss [0.0]
Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks.
We analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations.
arXiv Detail & Related papers (2020-02-11T15:42:09Z) - A Generalized Neural Tangent Kernel Analysis for Two-layer Neural
Networks [87.23360438947114]
We show that noisy gradient descent with weight decay can still exhibit a " Kernel-like" behavior.
This implies that the training loss converges linearly up to a certain accuracy.
We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.
arXiv Detail & Related papers (2020-02-10T18:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.