Related papers: Analyzing Sharpness along GD Trajectory: Progressive Sharpening and Edge of Stability

Analyzing Sharpness along GD Trajectory: Progressive Sharpening and Edge of Stability

URL: http://arxiv.org/abs/2207.12678v1
Date: Tue, 26 Jul 2022 06:37:58 GMT
Title: Analyzing Sharpness along GD Trajectory: Progressive Sharpening and Edge of Stability
Authors: Zhouzi Li, Zixuan Wang, Jian Li
Abstract summary: This paper aims to analyze the GD dynamics and the sharpness along the optimization trajectory. We empirically identify the norm of output layer weight as an interesting indicator of sharpness dynamics. We provide a theoretical proof of the sharpness behavior in EOS regime in two-layer fully-connected linear neural networks.
Score: 8.492339290649031
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent findings (e.g., arXiv:2103.00065) demonstrate that modern neural networks trained by full-batch gradient descent typically enter a regime called Edge of Stability (EOS). In this regime, the sharpness, i.e., the maximum Hessian eigenvalue, first increases to the value 2/(step size) (the progressive sharpening phase) and then oscillates around this value (the EOS phase). This paper aims to analyze the GD dynamics and the sharpness along the optimization trajectory. Our analysis naturally divides the GD trajectory into four phases depending on the change of the sharpness. We empirically identify the norm of output layer weight as an interesting indicator of sharpness dynamics. Based on this empirical observation, we attempt to theoretically and empirically explain the dynamics of various key quantities that lead to the change of sharpness in each phase of EOS. Moreover, based on certain assumptions, we provide a theoretical proof of the sharpness behavior in EOS regime in two-layer fully-connected linear neural networks. We also discuss some other empirical findings and the limitation of our theoretical results.

Related papers

Understanding Sharpness Dynamics in NN Training with a Minimalist Example: The Effects of Dataset Difficulty, Depth, Stochasticity, and More [10.65078014704416]
When training deep neural networks with sharpness often increases, before saturating at the edge of stability.<n>In this work, we study this phenomenon using a minimalist model: a deep linear network with a single neuron per layer.<n>We show that this simple model effectively captures the sharpness dynamics observed in recent empirical studies, offering a simple testbed to better understand neural network training.
arXiv Detail & Related papers (2025-06-07T22:35:13Z)
Learning Dynamics of Deep Linear Networks Beyond the Edge of Stability [14.88048518927001]
Deep neural networks trained using gradient descent with a fixed learning rate operate in the regime of "edge of stability" We show that loss oscillations beyond EOS follow a period-doubling route to chaos. Overall, our results contribute to explaining two key phenomena in deep networks.
arXiv Detail & Related papers (2025-02-27T21:32:42Z)
Universal Sharpness Dynamics in Neural Network Training: Fixed Point Analysis, Edge of Stability, and Route to Chaos [6.579523168465526]
In descent dynamics of neural networks, the top eigenvalue of the Hessian of the loss (sharpness) displays a variety of robust phenomena throughout training. We demonstrate that a simple $2$-layer linear network (UV model) trained on a single training example exhibits all of the essential sharpness phenomenology observed in real-world scenarios.
arXiv Detail & Related papers (2023-11-03T17:59:40Z)
Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifurcation Theory [14.141453107129403]
We study the evolution of the largest eigenvalue of the loss Hessian, also known as sharpness, along the gradient descent trajectory. The sharpness increases at the early phase of training, and eventually saturates close to the threshold of $2 / text(step size)$.
arXiv Detail & Related papers (2023-07-09T15:16:45Z)
Implicit Stochastic Gradient Descent for Training Physics-informed Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems. PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features. In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z)
Phase diagram of early training dynamics in deep neural networks: effect of the learning rate, depth, and width [0.0]
We systematically analyze optimization dynamics in deep neural networks (DNNs) trained with gradient descent (SGD) We find that the dynamics can show four distinct regimes: (i) an early time transient regime, (ii) an intermediate saturation regime, (iii) a progressive sharpening regime, and (iv) a late time edge of stability" regime.
arXiv Detail & Related papers (2023-02-23T18:59:30Z)
SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks. We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z)
Stability and Generalization Analysis of Gradient Methods for Shallow Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability. We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z)
A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks [12.355137704908042]
We show restrained numerical instabilities in current training practices of deep networks with gradient descent (SGD) We do this by presenting a theoretical framework using numerical analysis of partial differential equations (PDE), and analyzing the gradient descent PDE of convolutional neural networks (CNNs) We show this is a consequence of the non-linear PDE associated with the descent of the CNN, whose local linearization changes when over-driving the step size of the discretization resulting in a stabilizing effect.
arXiv Detail & Related papers (2022-06-04T14:54:05Z)
Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability [94.4070247697549]
Full-batch gradient descent on neural network training objectives operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 / text(step size)$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales.
arXiv Detail & Related papers (2021-02-26T22:08:19Z)
Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix. Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
Layer-wise Conditioning Analysis in Exploring the Learning Dynamics of DNNs [115.35745188028169]
We extend conditioning analysis to deep neural networks (DNNs) in order to investigate their learning dynamics. We show that batch normalization (BN) can stabilize the training, but sometimes result in the false impression of a local minimum. We experimentally observe that BN can improve the layer-wise conditioning of the optimization problem.
arXiv Detail & Related papers (2020-02-25T11:40:27Z)
The Break-Even Point on Optimization Trajectories of Deep Neural Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory. We show that using a large learning rate in the initial phase of training reduces the variance of the gradient. We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.