Universal Sharpness Dynamics in Neural Network Training: Fixed Point
Analysis, Edge of Stability, and Route to Chaos
- URL: http://arxiv.org/abs/2311.02076v1
- Date: Fri, 3 Nov 2023 17:59:40 GMT
- Title: Universal Sharpness Dynamics in Neural Network Training: Fixed Point
Analysis, Edge of Stability, and Route to Chaos
- Authors: Dayal Singh Kalra, Tianyu He, Maissam Barkeshli
- Abstract summary: In descent dynamics of neural networks, the top eigenvalue of the Hessian of the loss (sharpness) displays a variety of robust phenomena throughout training.
We demonstrate that a simple $2$-layer linear network (UV model) trained on a single training example exhibits all of the essential sharpness phenomenology observed in real-world scenarios.
- Score: 6.579523168465526
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In gradient descent dynamics of neural networks, the top eigenvalue of the
Hessian of the loss (sharpness) displays a variety of robust phenomena
throughout training. This includes early time regimes where the sharpness may
decrease during early periods of training (sharpness reduction), and later time
behavior such as progressive sharpening and edge of stability. We demonstrate
that a simple $2$-layer linear network (UV model) trained on a single training
example exhibits all of the essential sharpness phenomenology observed in
real-world scenarios. By analyzing the structure of dynamical fixed points in
function space and the vector field of function updates, we uncover the
underlying mechanisms behind these sharpness trends. Our analysis reveals (i)
the mechanism behind early sharpness reduction and progressive sharpening, (ii)
the required conditions for edge of stability, and (iii) a period-doubling
route to chaos on the edge of stability manifold as learning rate is increased.
Finally, we demonstrate that various predictions from this simplified model
generalize to real-world scenarios and discuss its limitations.
Related papers
- Exploring the Stability Gap in Continual Learning: The Role of the Classification Head [0.6749750044497732]
The stability gap is a phenomenon where models initially lose performance on previously learned tasks before partially recovering during training.
We introduce the nearest-mean classifier (NMC) as a tool to attribute the influence of the backbone and the classification head on the stability gap.
Our experiments demonstrate that NMC not only improves final performance, but also significantly enhances training stability across various continual learning benchmarks.
arXiv Detail & Related papers (2024-11-06T15:45:01Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Investigating the Edge of Stability Phenomenon in Reinforcement Learning [20.631461205889487]
We explore the edge of stability phenomenon in reinforcement learning (RL)
Despite significant differences to supervised learning, the edge of stability phenomenon can be present in off-policy deep RL.
Our results suggest that, while neural network structure can lead to optimisation dynamics that transfer between problem domains, certain aspects of deep RL optimisation can differentiate it from domains such as supervised learning.
arXiv Detail & Related papers (2023-07-09T15:46:27Z) - Trajectory Alignment: Understanding the Edge of Stability Phenomenon via
Bifurcation Theory [14.141453107129403]
We study the evolution of the largest eigenvalue of the loss Hessian, also known as sharpness, along the gradient descent trajectory.
The sharpness increases at the early phase of training, and eventually saturates close to the threshold of $2 / text(step size)$.
arXiv Detail & Related papers (2023-07-09T15:16:45Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - Analyzing Sharpness along GD Trajectory: Progressive Sharpening and Edge
of Stability [8.492339290649031]
This paper aims to analyze the GD dynamics and the sharpness along the optimization trajectory.
We empirically identify the norm of output layer weight as an interesting indicator of sharpness dynamics.
We provide a theoretical proof of the sharpness behavior in EOS regime in two-layer fully-connected linear neural networks.
arXiv Detail & Related papers (2022-07-26T06:37:58Z) - Gradient Descent on Neural Networks Typically Occurs at the Edge of
Stability [94.4070247697549]
Full-batch gradient descent on neural network training objectives operates in a regime we call the Edge of Stability.
In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 / text(step size)$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales.
arXiv Detail & Related papers (2021-02-26T22:08:19Z) - Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task.
This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z) - Layer-wise Conditioning Analysis in Exploring the Learning Dynamics of
DNNs [115.35745188028169]
We extend conditioning analysis to deep neural networks (DNNs) in order to investigate their learning dynamics.
We show that batch normalization (BN) can stabilize the training, but sometimes result in the false impression of a local minimum.
We experimentally observe that BN can improve the layer-wise conditioning of the optimization problem.
arXiv Detail & Related papers (2020-02-25T11:40:27Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.