Sharpness-Aware Minimization: General Analysis and Improved Rates
- URL: http://arxiv.org/abs/2503.02225v1
- Date: Tue, 04 Mar 2025 03:04:06 GMT
- Title: Sharpness-Aware Minimization: General Analysis and Improved Rates
- Authors: Dimitris Oikonomou, Nicolas Loizou,
- Abstract summary: Sharpness-Aware Minimization (SAM) has emerged as a powerful method for improving generalization in machine learning models.<n>We provide an analysis of SAM and its unnormalized variant rule rule (USAM) under one update.<n>We present results of the new size under a relaxed more natural assumption.
- Score: 10.11126899274029
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sharpness-Aware Minimization (SAM) has emerged as a powerful method for improving generalization in machine learning models by minimizing the sharpness of the loss landscape. However, despite its success, several important questions regarding the convergence properties of SAM in non-convex settings are still open, including the benefits of using normalization in the update rule, the dependence of the analysis on the restrictive bounded variance assumption, and the convergence guarantees under different sampling strategies. To address these questions, in this paper, we provide a unified analysis of SAM and its unnormalized variant (USAM) under one single flexible update rule (Unified SAM), and we present convergence results of the new algorithm under a relaxed and more natural assumption on the stochastic noise. Our analysis provides convergence guarantees for SAM under different step size selections for non-convex problems and functions that satisfy the Polyak-Lojasiewicz (PL) condition (a non-convex generalization of strongly convex functions). The proposed theory holds under the arbitrary sampling paradigm, which includes importance sampling as special case, allowing us to analyze variants of SAM that were never explicitly considered in the literature. Experiments validate the theoretical findings and further demonstrate the practical effectiveness of Unified SAM in training deep neural networks for image classification tasks.
Related papers
- Stochastic Optimization with Optimal Importance Sampling [49.484190237840714]
We propose an iterative-based algorithm that jointly updates the decision and the IS distribution without requiring time-scale separation between the two.
Our method achieves the lowest possible variable variance and guarantees global convergence under convexity of the objective and mild assumptions on the IS distribution family.
arXiv Detail & Related papers (2025-04-04T16:10:18Z) - Avoiding spurious sharpness minimization broadens applicability of SAM [13.21265875272573]
Curvature regularization techniques like Sharpness Aware Minimization (SAM) have shown great promise in improving generalization on vision tasks.
We find that SAM performs poorly in domains like natural language processing (NLP), often degrading performance -- even with twice the compute budget.
We develop an alternative algorithm we call Functional-SAM, which regularizes curvature only through modification of the statistics of the overall function.
arXiv Detail & Related papers (2025-02-04T15:25:47Z) - Preconditioned Sharpness-Aware Minimization: Unifying Analysis and a Novel Learning Algorithm [39.656014609027494]
sharpness-aware minimization (SAM) has emerged as a powerful tool to improve generalizability of deep neural network based learning.<n>This contribution leverages preconditioning (pre) to unify SAM variants and provide not only unifying convergence analysis, but also valuable insights.<n>A novel algorithm termed infoSAM is introduced to address the so-called adversarial model degradation issue in SAM by adjusting gradients depending on noise estimates.
arXiv Detail & Related papers (2025-01-11T18:05:33Z) - Generalization Bounds of Surrogate Policies for Combinatorial Optimization Problems [61.580419063416734]
A recent stream of structured learning approaches has improved the practical state of the art for a range of optimization problems.
The key idea is to exploit the statistical distribution over instances instead of dealing with instances separately.
In this article, we investigate methods that smooth the risk by perturbing the policy, which eases optimization and improves the generalization error.
arXiv Detail & Related papers (2024-07-24T12:00:30Z) - A Universal Class of Sharpness-Aware Minimization Algorithms [57.29207151446387]
We introduce a new class of sharpness measures, leading to new sharpness-aware objective functions.
We prove that these measures are textitly expressive, allowing any function of the training loss Hessian matrix to be represented by appropriate hyper and determinants.
arXiv Detail & Related papers (2024-06-06T01:52:09Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Critical Influence of Overparameterization on Sharpness-aware Minimization [12.321517302762558]
We show that sharpness-aware minimization (SAM) strategy is affected by over parameterization.
We prove multiple theoretical benefits of over parameterization for SAM to attain (i) minima with more uniform Hessian moments compared to SGD, (ii) much faster convergence at a linear rate, and (iii) lower test error for two-layer networks.
arXiv Detail & Related papers (2023-11-29T11:19:50Z) - Normalization Layers Are All That Sharpness-Aware Minimization Needs [53.799769473526275]
Sharpness-aware minimization (SAM) was proposed to reduce sharpness of minima.
We show that perturbing only the affine normalization parameters (typically comprising 0.1% of the total parameters) in the adversarial step of SAM can outperform perturbing all of the parameters.
arXiv Detail & Related papers (2023-06-07T08:05:46Z) - AdaSAM: Boosting Sharpness-Aware Minimization with Adaptive Learning
Rate and Momentum for Training Deep Neural Networks [76.90477930208982]
Sharpness aware (SAM) has been extensively explored as it can generalize better for training deep neural networks.
Integrating SAM with adaptive learning perturbation and momentum acceleration, dubbed AdaSAM, has already been explored.
We conduct several experiments on several NLP tasks, which show that AdaSAM could achieve superior performance compared with SGD, AMS, and SAMsGrad.
arXiv Detail & Related papers (2023-03-01T15:12:42Z) - Single-Call Stochastic Extragradient Methods for Structured Non-monotone
Variational Inequalities: Improved Analysis under Weaker Conditions [21.681130192954583]
Single-call extragradient methods are one of the most efficient algorithms for solving large-scale min-max optimization problems.
We provide convergence guarantees for two classes of VIPs: (i) quasi-strongly monotone problems (a generalization of strongly monotone problems) and (ii) weak Minty variational inequalities (a generalization of monotone and Minty VIPs)
Our convergence analysis holds under the arbitrary sampling paradigm, which includes importance sampling and various mini-batching strategies as special cases.
arXiv Detail & Related papers (2023-02-27T18:53:28Z) - On Statistical Properties of Sharpness-Aware Minimization: Provable
Guarantees [5.91402820967386]
We present a new theoretical explanation of why Sharpness-Aware Minimization (SAM) generalizes well.
SAM is particularly well-suited for both sharp and non-sharp problems.
Our findings are validated using numerical experiments on deep neural networks.
arXiv Detail & Related papers (2023-02-23T07:52:31Z) - Stochastic Extragradient: General Analysis and Improved Rates [23.29653334470774]
The Extragradient (SEG) method is one of the most popular algorithms for solving min-max optimization and variational inequalities problems.
We develop a novel theoretical framework that allows us to analyze several variants of SEG in a unified manner.
Our rates for the new variants of SEG outperform the current state-of-the-art convergence guarantees and rely on less restrictive assumptions.
arXiv Detail & Related papers (2021-11-16T16:49:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.