Related papers: Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability

Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability

URL: http://arxiv.org/abs/2103.00065v1
Date: Fri, 26 Feb 2021 22:08:19 GMT
Title: Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability
Authors: Jeremy M. Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, Ameet Talwalkar
Abstract summary: Full-batch gradient descent on neural network training objectives operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 / text(step size)$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales.
Score: 94.4070247697549
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 / \text{(step size)}$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales. Since this behavior is inconsistent with several widespread presumptions in the field of optimization, our findings raise questions as to whether these presumptions are relevant to neural network training. We hope that our findings will inspire future efforts aimed at rigorously understanding optimization at the Edge of Stability. Code is available at https://github.com/locuslab/edge-of-stability.

Related papers

Efficiently Training Time-to-First-Spike Spiking Neural Networks from Scratch [39.05124192217359]
Spiking Neural Networks (SNNs) are well-suited for energy-efficient neuromorphic hardware. Time-to-First-Spike (TTFS) coding, which uses a single spike per neuron, offers extreme sparsity and energy efficiency but suffers from unstable training and low accuracy due to its sparse firing. We propose a training framework incorporating parameter normalization, training normalization, temporal output decoding, and pooling layer re-evaluation. Experiments show the framework stabilizes and accelerates training, reduces latency, and achieves state-of-the-art accuracy for TTFS SNNs on M
arXiv Detail & Related papers (2024-10-31T04:14:47Z)
Training on the Edge of Stability Is Caused by Layerwise Jacobian Alignment [0.0]
We use an exponential solver to train a neural network without entering the edge of stability. We demonstrate experimentally that the increase in the sharpness of the Hessian matrix is caused by the layerwise Jacobian matrices of the network becoming aligned.
arXiv Detail & Related papers (2024-05-31T18:37:06Z)
Globally Optimal Training of Neural Networks with Threshold Activation Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations. We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z)
Theoretical Characterization of How Neural Network Pruning Affects its Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization. It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero. More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z)
Temporal Efficient Training of Spiking Neural Network via Gradient Re-weighting [29.685909045226847]
Brain-inspired spiking neuron networks (SNNs) have attracted widespread research interest because of their event-driven and energy-efficient characteristics. Current direct training approach with surrogate gradient results in SNNs with poor generalizability. We introduce the temporal efficient training (TET) approach to compensate for the loss of momentum in the gradient descent with SG.
arXiv Detail & Related papers (2022-02-24T08:02:37Z)
Navigating Local Minima in Quantized Spiking Neural Networks [3.1351527202068445]
Spiking and Quantized Neural Networks (NNs) are becoming exceedingly important for hyper-efficient implementations of Deep Learning (DL) algorithms. These networks face challenges when trained using error backpropagation, due to the absence of gradient signals when applying hard thresholds. This paper presents a systematic evaluation of a cosine-annealed LR schedule coupled with weight-independent adaptive moment estimation.
arXiv Detail & Related papers (2022-02-15T06:42:25Z)
Convergence rates for gradient descent in the training of overparameterized artificial neural networks with biases [3.198144010381572]
In recent years, artificial neural networks have developed into a powerful tool for dealing with a multitude of problems for which classical solution approaches. It is still unclear why randomly gradient descent algorithms reach their limits.
arXiv Detail & Related papers (2021-02-23T18:17:47Z)
When and why PINNs fail to train: A neural tangent kernel perspective [2.1485350418225244]
We derive the Neural Tangent Kernel (NTK) of PINNs and prove that, under appropriate conditions, it converges to a deterministic kernel that stays constant during training in the infinite-width limit. We find a remarkable discrepancy in the convergence rate of the different loss components contributing to the total training error. We propose a novel gradient descent algorithm that utilizes the eigenvalues of the NTK to adaptively calibrate the convergence rate of the total training error.
arXiv Detail & Related papers (2020-07-28T23:44:56Z)
Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix. Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
The Break-Even Point on Optimization Trajectories of Deep Neural Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory. We show that using a large learning rate in the initial phase of training reduces the variance of the gradient. We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)
Towards Unified INT8 Training for Convolutional Neural Network [83.15673050981624]
We build a unified 8-bit (INT8) training framework for common convolutional neural networks. First, we empirically find the four distinctive characteristics of gradients, which provide us insightful clues for gradient quantization. We propose two universal techniques, including Direction Sensitive Gradient Clipping that reduces the direction deviation of gradients.
arXiv Detail & Related papers (2019-12-29T08:37:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.