Related papers: Sharpness-Aware Minimization and the Edge of Stability

Sharpness-Aware Minimization and the Edge of Stability

URL: http://arxiv.org/abs/2309.12488v6
Date: Wed, 5 Jun 2024 20:31:45 GMT
Title: Sharpness-Aware Minimization and the Edge of Stability
Authors: Philip M. Long, Peter L. Bartlett,
Abstract summary: We show that when training a neural network with gradient descent (GD) with a step size $eta$, the norm of the Hessian of the loss grows until it approximately reaches $2/eta$, after which it fluctuates around this value. We perform a similar calculation to arrive at an "edge of stability" for Sharpness-Aware Minimization (SAM) Unlike the case for GD, the resulting SAM-edge depends on the norm of the gradient. Using three deep learning training tasks, we see empirically that SAM operates on the edge of stability identified by this analysis.
Score: 35.27697224229969
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent experiments have shown that, often, when training a neural network with gradient descent (GD) with a step size $\eta$, the operator norm of the Hessian of the loss grows until it approximately reaches $2/\eta$, after which it fluctuates around this value. The quantity $2/\eta$ has been called the "edge of stability" based on consideration of a local quadratic approximation of the loss. We perform a similar calculation to arrive at an "edge of stability" for Sharpness-Aware Minimization (SAM), a variant of GD which has been shown to improve its generalization. Unlike the case for GD, the resulting SAM-edge depends on the norm of the gradient. Using three deep learning training tasks, we see empirically that SAM operates on the edge of stability identified by this analysis.

Related papers

Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD [0.0]
We show that mini-batch gradient descent (SGD) trains in a different regime we term Edge of Stability (EoSS) What stabilizes at $2/eta$ is *Batch Sharpness*: the expected directional curvature of mini-batch Hessians along their corresponding gradients. We further discuss implications for mathematical modeling of SGD trajectories.
arXiv Detail & Related papers (2024-12-29T18:59:01Z)
Friendly Sharpness-Aware Minimization [62.57515991835801]
Sharpness-Aware Minimization (SAM) has been instrumental in improving deep neural network training by minimizing both training loss and loss sharpness. We investigate the key role of batch-specific gradient noise within the adversarial perturbation, i.e., the current minibatch gradient. By decomposing the adversarial gradient noise components, we discover that relying solely on the full gradient degrades generalization while excluding it leads to improved performance.
arXiv Detail & Related papers (2024-03-19T01:39:33Z)
CR-SAM: Curvature Regularized Sharpness-Aware Minimization [8.248964912483912]
Sharpness-Aware Minimization (SAM) aims to enhance the generalizability by minimizing worst-case loss using one-step gradient ascent as an approximation. In this paper, we introduce a normalized Hessian trace to accurately measure the curvature of loss landscape on em both training and test sets. In particular, to counter excessive non-linearity of loss landscape, we propose Curvature Regularized SAM (CR-SAM)
arXiv Detail & Related papers (2023-12-21T03:46:29Z)
K-SAM: Sharpness-Aware Minimization at the Speed of SGD [83.78737278889837]
Sharpness-Aware Minimization (SAM) has emerged as a robust technique for improving the accuracy of deep neural networks. SAM incurs a high computational cost in practice, requiring up to twice as much computation as vanilla SGD. We propose to compute gradients in both stages of SAM on only the top-k samples with highest loss.
arXiv Detail & Related papers (2022-10-23T21:49:58Z)
Understanding Edge-of-Stability Training Dynamics with a Minimalist Example [20.714857891192345]
Recently, researchers observed that descent for deep neural networks operates in an edge-of-stability'' (EoS) regime. We give rigorous analysis for its dynamics in a large local region and explain why the final converging point has sharpness to $2/eta$.
arXiv Detail & Related papers (2022-10-07T02:57:05Z)
Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability [40.17821914923602]
We show that gradient descent at the edge of stability implicitly follows projected gradient descent (PGD) under the constraint $S(theta) le 2/eta$. Our analysis provides precise predictions for the loss, sharpness, and deviation from the PGD trajectory throughout training.
arXiv Detail & Related papers (2022-09-30T17:15:12Z)
Sharpness-Aware Training for Free [163.1248341911413]
SharpnessAware Minimization (SAM) has shown that minimizing a sharpness measure, which reflects the geometry of the loss landscape, can significantly reduce the generalization error. Sharpness-Aware Training Free (SAF) mitigates the sharp landscape at almost zero computational cost over the base. SAF ensures the convergence to a flat minimum with improved capabilities.
arXiv Detail & Related papers (2022-05-27T16:32:43Z)
Understanding Gradient Descent on Edge of Stability in Deep Learning [32.03036040349019]
This paper mathematically analyzes a new mechanism of implicit regularization in the EoS phase, whereby GD updates due to non-smooth loss landscape turn out to evolve along some deterministic flow on the manifold of minimum loss. The above theoretical results have been corroborated by an experimental study.
arXiv Detail & Related papers (2022-05-19T17:57:01Z)
Understanding the unstable convergence of gradient descent [51.40523554349091]
In machine learning applications step sizes often do not fulfill the condition that for $L$-smooth cost, the step size is less than $2/L$. We investigate this unstable convergence phenomenon from first principles, and elucidate key causes behind it. We also identify its main characteristics, and how they interrelate, offering a transparent view backed by both theory and experiments.
arXiv Detail & Related papers (2022-04-03T11:10:17Z)
Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability [94.4070247697549]
Full-batch gradient descent on neural network training objectives operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 / text(step size)$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales.
arXiv Detail & Related papers (2021-02-26T22:08:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.