Analyzing Sharpness along GD Trajectory: Progressive Sharpening and Edge
of Stability
- URL: http://arxiv.org/abs/2207.12678v1
- Date: Tue, 26 Jul 2022 06:37:58 GMT
- Title: Analyzing Sharpness along GD Trajectory: Progressive Sharpening and Edge
of Stability
- Authors: Zhouzi Li, Zixuan Wang, Jian Li
- Abstract summary: This paper aims to analyze the GD dynamics and the sharpness along the optimization trajectory.
We empirically identify the norm of output layer weight as an interesting indicator of sharpness dynamics.
We provide a theoretical proof of the sharpness behavior in EOS regime in two-layer fully-connected linear neural networks.
- Score: 8.492339290649031
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent findings (e.g., arXiv:2103.00065) demonstrate that modern neural
networks trained by full-batch gradient descent typically enter a regime called
Edge of Stability (EOS). In this regime, the sharpness, i.e., the maximum
Hessian eigenvalue, first increases to the value 2/(step size) (the progressive
sharpening phase) and then oscillates around this value (the EOS phase). This
paper aims to analyze the GD dynamics and the sharpness along the optimization
trajectory. Our analysis naturally divides the GD trajectory into four phases
depending on the change of the sharpness. We empirically identify the norm of
output layer weight as an interesting indicator of sharpness dynamics. Based on
this empirical observation, we attempt to theoretically and empirically explain
the dynamics of various key quantities that lead to the change of sharpness in
each phase of EOS. Moreover, based on certain assumptions, we provide a
theoretical proof of the sharpness behavior in EOS regime in two-layer
fully-connected linear neural networks. We also discuss some other empirical
findings and the limitation of our theoretical results.
Related papers
- Universal Sharpness Dynamics in Neural Network Training: Fixed Point
Analysis, Edge of Stability, and Route to Chaos [6.579523168465526]
In descent dynamics of neural networks, the top eigenvalue of the Hessian of the loss (sharpness) displays a variety of robust phenomena throughout training.
We demonstrate that a simple $2$-layer linear network (UV model) trained on a single training example exhibits all of the essential sharpness phenomenology observed in real-world scenarios.
arXiv Detail & Related papers (2023-11-03T17:59:40Z) - Trajectory Alignment: Understanding the Edge of Stability Phenomenon via
Bifurcation Theory [14.141453107129403]
We study the evolution of the largest eigenvalue of the loss Hessian, also known as sharpness, along the gradient descent trajectory.
The sharpness increases at the early phase of training, and eventually saturates close to the threshold of $2 / text(step size)$.
arXiv Detail & Related papers (2023-07-09T15:16:45Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Phase diagram of early training dynamics in deep neural networks: effect
of the learning rate, depth, and width [0.0]
We systematically analyze optimization dynamics in deep neural networks (DNNs) trained with gradient descent (SGD)
We find that the dynamics can show four distinct regimes: (i) an early time transient regime, (ii) an intermediate saturation regime, (iii) a progressive sharpening regime, and (iv) a late time edge of stability" regime.
arXiv Detail & Related papers (2023-02-23T18:59:30Z) - SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks.
We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks [12.355137704908042]
We show restrained numerical instabilities in current training practices of deep networks with gradient descent (SGD)
We do this by presenting a theoretical framework using numerical analysis of partial differential equations (PDE), and analyzing the gradient descent PDE of convolutional neural networks (CNNs)
We show this is a consequence of the non-linear PDE associated with the descent of the CNN, whose local linearization changes when over-driving the step size of the discretization resulting in a stabilizing effect.
arXiv Detail & Related papers (2022-06-04T14:54:05Z) - Gradient Descent on Neural Networks Typically Occurs at the Edge of
Stability [94.4070247697549]
Full-batch gradient descent on neural network training objectives operates in a regime we call the Edge of Stability.
In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 / text(step size)$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales.
arXiv Detail & Related papers (2021-02-26T22:08:19Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - Layer-wise Conditioning Analysis in Exploring the Learning Dynamics of
DNNs [115.35745188028169]
We extend conditioning analysis to deep neural networks (DNNs) in order to investigate their learning dynamics.
We show that batch normalization (BN) can stabilize the training, but sometimes result in the false impression of a local minimum.
We experimentally observe that BN can improve the layer-wise conditioning of the optimization problem.
arXiv Detail & Related papers (2020-02-25T11:40:27Z) - The Break-Even Point on Optimization Trajectories of Deep Neural
Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory.
We show that using a large learning rate in the initial phase of training reduces the variance of the gradient.
We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.