Trajectory Alignment: Understanding the Edge of Stability Phenomenon via
Bifurcation Theory
- URL: http://arxiv.org/abs/2307.04204v2
- Date: Thu, 26 Oct 2023 15:16:38 GMT
- Title: Trajectory Alignment: Understanding the Edge of Stability Phenomenon via
Bifurcation Theory
- Authors: Minhak Song, Chulhee Yun
- Abstract summary: We study the evolution of the largest eigenvalue of the loss Hessian, also known as sharpness, along the gradient descent trajectory.
The sharpness increases at the early phase of training, and eventually saturates close to the threshold of $2 / text(step size)$.
- Score: 14.141453107129403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cohen et al. (2021) empirically study the evolution of the largest eigenvalue
of the loss Hessian, also known as sharpness, along the gradient descent (GD)
trajectory and observe the Edge of Stability (EoS) phenomenon. The sharpness
increases at the early phase of training (referred to as progressive
sharpening), and eventually saturates close to the threshold of $2 /
\text{(step size)}$. In this paper, we start by demonstrating through empirical
studies that when the EoS phenomenon occurs, different GD trajectories (after a
proper reparameterization) align on a specific bifurcation diagram independent
of initialization. We then rigorously prove this trajectory alignment
phenomenon for a two-layer fully-connected linear network and a single-neuron
nonlinear network trained with a single data point. Our trajectory alignment
analysis establishes both progressive sharpening and EoS phenomena,
encompassing and extending recent findings in the literature.
Related papers
- Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization [41.20978920228298]
We show that the second phase begins once the empirical risk falls below a certain threshold, dependent on the stepsize.
We also show that the normalized margin grows nearly monotonically in the second phase, demonstrating an implicit bias of GD in training non-homogeneous predictors.
Our analysis applies to networks of any width, beyond the well-known neural tangent kernel and mean-field regimes.
arXiv Detail & Related papers (2024-06-12T21:33:22Z) - On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used.
We provide a proof of this in the case of linear neural networks with a squared loss.
We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z) - Universal Sharpness Dynamics in Neural Network Training: Fixed Point
Analysis, Edge of Stability, and Route to Chaos [6.579523168465526]
In descent dynamics of neural networks, the top eigenvalue of the Hessian of the loss (sharpness) displays a variety of robust phenomena throughout training.
We demonstrate that a simple $2$-layer linear network (UV model) trained on a single training example exhibits all of the essential sharpness phenomenology observed in real-world scenarios.
arXiv Detail & Related papers (2023-11-03T17:59:40Z) - Implicit Bias of Gradient Descent for Logistic Regression at the Edge of
Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS)
This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Analyzing Sharpness along GD Trajectory: Progressive Sharpening and Edge
of Stability [8.492339290649031]
This paper aims to analyze the GD dynamics and the sharpness along the optimization trajectory.
We empirically identify the norm of output layer weight as an interesting indicator of sharpness dynamics.
We provide a theoretical proof of the sharpness behavior in EOS regime in two-layer fully-connected linear neural networks.
arXiv Detail & Related papers (2022-07-26T06:37:58Z) - Beyond the Edge of Stability via Two-step Gradient Updates [49.03389279816152]
Gradient Descent (GD) is a powerful workhorse of modern machine learning.
GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients.
This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
arXiv Detail & Related papers (2022-06-08T21:32:50Z) - Vanishing Curvature and the Power of Adaptive Methods in Randomly
Initialized Deep Networks [30.467121747150816]
This paper revisits the so-called vanishing gradient phenomenon, which commonly occurs in deep randomly neural networks.
We first show that vanishing gradients cannot be circumvented when the network width scales with less than O(depth)
arXiv Detail & Related papers (2021-06-07T16:29:59Z) - Directional Convergence Analysis under Spherically Symmetric
Distribution [21.145823611499104]
We consider the fundamental problem of learning linear predictors (i.e., separable datasets with zero margin) using neural networks with gradient flow or gradient descent.
We show directional convergence guarantees with exact convergence rate for two-layer non-linear networks with only two hidden nodes, and (deep) linear networks.
arXiv Detail & Related papers (2021-05-09T08:59:58Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - The Break-Even Point on Optimization Trajectories of Deep Neural
Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory.
We show that using a large learning rate in the initial phase of training reduces the variance of the gradient.
We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.