Stability Analysis of Sharpness-Aware Minimization
- URL: http://arxiv.org/abs/2301.06308v1
- Date: Mon, 16 Jan 2023 08:42:40 GMT
- Title: Stability Analysis of Sharpness-Aware Minimization
- Authors: Hoki Kim, Jinseong Park, Yujin Choi, and Jaewook Lee
- Abstract summary: Sharpness-aware (SAM) is a recently proposed training method that seeks to find flat minima in deep learning.
In this paper, we demonstrate that SAM dynamics can have convergence instability that occurs near a saddle point.
- Score: 5.024497308975435
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Sharpness-aware minimization (SAM) is a recently proposed training method
that seeks to find flat minima in deep learning, resulting in state-of-the-art
performance across various domains. Instead of minimizing the loss of the
current weights, SAM minimizes the worst-case loss in its neighborhood in the
parameter space. In this paper, we demonstrate that SAM dynamics can have
convergence instability that occurs near a saddle point. Utilizing the
qualitative theory of dynamical systems, we explain how SAM becomes stuck in
the saddle point and then theoretically prove that the saddle point can become
an attractor under SAM dynamics. Additionally, we show that this convergence
instability can also occur in stochastic dynamical systems by establishing the
diffusion of SAM. We prove that SAM diffusion is worse than that of vanilla
gradient descent in terms of saddle point escape. Further, we demonstrate that
often overlooked training tricks, momentum and batch-size, are important to
mitigate the convergence instability and achieve high generalization
performance. Our theoretical and empirical results are thoroughly verified
through experiments on several well-known optimization problems and benchmark
tasks.
Related papers
- Sharpness-Aware Minimization Efficiently Selects Flatter Minima Late in Training [47.25594539120258]
We find that Sharpness-Aware Minimization (SAM) efficiently selects flatter minima late in training.
Even a few epochs of SAM applied at the end of training yield nearly the same generalization and solution sharpness as full SAM training.
We conjecture that the optimization method chosen in the late phase is more crucial in shaping the final solution's properties.
arXiv Detail & Related papers (2024-10-14T10:56:42Z) - Bilateral Sharpness-Aware Minimization for Flatter Minima [61.17349662062522]
Sharpness-Aware Minimization (SAM) enhances generalization by reducing a Max-Sharpness (MaxS)
In this paper, we propose to utilize the difference between the training loss and the minimum loss over the neighborhood surrounding the current weight, which we denote as Min-Sharpness (MinS)
By merging MaxS and MinS, we created a better FI that indicates a flatter direction during the optimization. Specially, we combine this FI with SAM into the proposed Bilateral SAM (BSAM) which finds a more flatter minimum than that of SAM.
arXiv Detail & Related papers (2024-09-20T03:01:13Z) - Friendly Sharpness-Aware Minimization [62.57515991835801]
Sharpness-Aware Minimization (SAM) has been instrumental in improving deep neural network training by minimizing both training loss and loss sharpness.
We investigate the key role of batch-specific gradient noise within the adversarial perturbation, i.e., the current minibatch gradient.
By decomposing the adversarial gradient noise components, we discover that relying solely on the full gradient degrades generalization while excluding it leads to improved performance.
arXiv Detail & Related papers (2024-03-19T01:39:33Z) - Stabilizing Sharpness-aware Minimization Through A Simple Renormalization Strategy [12.050160495730381]
sharpness-aware generalization (SAM) has attracted much attention because of its surprising effectiveness in improving performance.
We propose a simple renormalization strategy, dubbed Stable SAM (SSAM), so that the gradient norm of the descent step maintains the same as that of the ascent step.
Our strategy is easy to implement and flexible enough to integrate with SAM and its variants, almost at no computational cost.
arXiv Detail & Related papers (2024-01-14T10:53:36Z) - Critical Influence of Overparameterization on Sharpness-aware Minimization [12.321517302762558]
We show that sharpness-aware minimization (SAM) strategy is affected by over parameterization.
We prove multiple theoretical benefits of over parameterization for SAM to attain (i) minima with more uniform Hessian moments compared to SGD, (ii) much faster convergence at a linear rate, and (iii) lower test error for two-layer networks.
arXiv Detail & Related papers (2023-11-29T11:19:50Z) - Enhancing Sharpness-Aware Optimization Through Variance Suppression [48.908966673827734]
This work embraces the geometry of the loss function, where neighborhoods of 'flat minima' heighten generalization ability.
It seeks 'flat valleys' by minimizing the maximum loss caused by an adversary perturbing parameters within the neighborhood.
Although critical to account for sharpness of the loss function, such an 'over-friendly adversary' can curtail the outmost level of generalization.
arXiv Detail & Related papers (2023-09-27T13:18:23Z) - Systematic Investigation of Sparse Perturbed Sharpness-Aware
Minimization Optimizer [158.2634766682187]
Deep neural networks often suffer from poor generalization due to complex and non- unstructured loss landscapes.
SharpnessAware Minimization (SAM) is a popular solution that smooths the loss by minimizing the change of landscape when adding a perturbation.
In this paper, we propose Sparse SAM (SSAM), an efficient and effective training scheme that achieves perturbation by a binary mask.
arXiv Detail & Related papers (2023-06-30T09:33:41Z) - AdaSAM: Boosting Sharpness-Aware Minimization with Adaptive Learning
Rate and Momentum for Training Deep Neural Networks [76.90477930208982]
Sharpness aware (SAM) has been extensively explored as it can generalize better for training deep neural networks.
Integrating SAM with adaptive learning perturbation and momentum acceleration, dubbed AdaSAM, has already been explored.
We conduct several experiments on several NLP tasks, which show that AdaSAM could achieve superior performance compared with SGD, AMS, and SAMsGrad.
arXiv Detail & Related papers (2023-03-01T15:12:42Z) - On Statistical Properties of Sharpness-Aware Minimization: Provable
Guarantees [5.91402820967386]
We present a new theoretical explanation of why Sharpness-Aware Minimization (SAM) generalizes well.
SAM is particularly well-suited for both sharp and non-sharp problems.
Our findings are validated using numerical experiments on deep neural networks.
arXiv Detail & Related papers (2023-02-23T07:52:31Z) - SAM operates far from home: eigenvalue regularization as a dynamical
phenomenon [15.332235979022036]
The Sharpness Aware Minimization (SAM) algorithm has been shown to control large eigenvalues of the loss Hessian.
We show that SAM provides a strong regularization of the eigenvalues throughout the learning trajectory.
Our theory predicts the largest eigenvalue as a function of the learning rate and SAM radius parameters.
arXiv Detail & Related papers (2023-02-17T04:51:20Z) - An SDE for Modeling SAM: Theory and Insights [7.1967126772249586]
We study the SAM (Sharpness-Aware Minimization) which has recently attracted a lot of interest due to its increased performance over classical variants of descent.
Our main contribution is the derivation of continuous-time models (in the form of SDEs) for SAM and two gradient of its variants, both for the full-batch and mini-batch settings.
arXiv Detail & Related papers (2023-01-19T17:54:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.