Related papers: Avoiding spurious sharpness minimization broadens applicability of SAM

Avoiding spurious sharpness minimization broadens applicability of SAM

URL: http://arxiv.org/abs/2502.02407v1
Date: Tue, 04 Feb 2025 15:25:47 GMT
Title: Avoiding spurious sharpness minimization broadens applicability of SAM
Authors: Sidak Pal Singh, Hossein Mobahi, Atish Agarwala, Yann Dauphin,
Abstract summary: Curvature regularization techniques like Sharpness Aware Minimization (SAM) have shown great promise in improving generalization on vision tasks.<n>We find that SAM performs poorly in domains like natural language processing (NLP), often degrading performance -- even with twice the compute budget.<n>We develop an alternative algorithm we call Functional-SAM, which regularizes curvature only through modification of the statistics of the overall function.
Score: 13.21265875272573
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Curvature regularization techniques like Sharpness Aware Minimization (SAM) have shown great promise in improving generalization on vision tasks. However, we find that SAM performs poorly in domains like natural language processing (NLP), often degrading performance -- even with twice the compute budget. We investigate the discrepancy across domains and find that in the NLP setting, SAM is dominated by regularization of the logit statistics -- instead of improving the geometry of the function itself. We use this observation to develop an alternative algorithm we call Functional-SAM, which regularizes curvature only through modification of the statistics of the overall function implemented by the neural network, and avoids spurious minimization through logit manipulation. Furthermore, we argue that preconditioning the SAM perturbation also prevents spurious minimization, and when combined with Functional-SAM, it gives further improvements. Our proposed algorithms show improved performance over AdamW and SAM baselines when trained for an equal number of steps, in both fixed-length and Chinchilla-style training settings, at various model scales (including billion-parameter scale). On the whole, our work highlights the importance of more precise characterizations of sharpness in broadening the applicability of curvature regularization to large language models (LLMs).

Related papers

LORENZA: Enhancing Generalization in Low-Rank Gradient LLM Training via Efficient Zeroth-Order Adaptive SAM [13.180761892449736]
We study robust parameter-efficient fine-tuning (PEFT) techniques for large-language models (LLMs) We present a new highly computationally efficient framework called AdaZo-SAM, combining Adam and Sharpness-Aware Minimization (SAM) We also design a low-rank gradient optimization method named LORENZA, which is a memory-efficient version of AdaZo-SAM.
arXiv Detail & Related papers (2025-02-26T21:30:34Z)
Preconditioned Sharpness-Aware Minimization: Unifying Analysis and a Novel Learning Algorithm [39.656014609027494]
sharpness-aware minimization (SAM) has emerged as a powerful tool to improve generalizability of deep neural network based learning. This contribution leverages preconditioning (pre) to unify SAM variants and provide not only unifying convergence analysis, but also valuable insights. A novel algorithm termed infoSAM is introduced to address the so-called adversarial model degradation issue in SAM by adjusting gradients depending on noise estimates.
arXiv Detail & Related papers (2025-01-11T18:05:33Z)
μP$^2$: Effective Sharpness Aware Minimization Requires Layerwise Perturbation Scaling [49.25546155981064]
We study the infinite-width limit of neural networks trained with Sharpness Aware Minimization (SAM) Our findings reveal that the dynamics of standard SAM effectively reduce to applying SAM solely in the last layer in wide neural networks. In contrast, we identify a stable parameterization with layerwise scaling, which we call $textitMaximal Update and Perturbation $ ($mu$P$2$), that ensures all layers are both feature learning and effectively perturbed in the limit.
arXiv Detail & Related papers (2024-10-31T16:32:04Z)
Implicit Regularization of Sharpness-Aware Minimization for Scale-Invariant Problems [26.377807940655305]
This work introduces a concept termed balancedness, defined as the difference between the squared norm of two variables. We develop a resource-efficient SAM variant, balancedness-aware regularization (BAR), tailored for scale-invariant problems.
arXiv Detail & Related papers (2024-10-18T18:19:18Z)
A Universal Class of Sharpness-Aware Minimization Algorithms [57.29207151446387]
We introduce a new class of sharpness measures, leading to new sharpness-aware objective functions. We prove that these measures are textitly expressive, allowing any function of the training loss Hessian matrix to be represented by appropriate hyper and determinants.
arXiv Detail & Related papers (2024-06-06T01:52:09Z)
Stabilizing Sharpness-aware Minimization Through A Simple Renormalization Strategy [12.050160495730381]
sharpness-aware generalization (SAM) has attracted much attention because of its surprising effectiveness in improving performance. We propose a simple renormalization strategy, dubbed Stable SAM (SSAM), so that the gradient norm of the descent step maintains the same as that of the ascent step. Our strategy is easy to implement and flexible enough to integrate with SAM and its variants, almost at no computational cost.
arXiv Detail & Related papers (2024-01-14T10:53:36Z)
Systematic Investigation of Sparse Perturbed Sharpness-Aware Minimization Optimizer [158.2634766682187]
Deep neural networks often suffer from poor generalization due to complex and non- unstructured loss landscapes. SharpnessAware Minimization (SAM) is a popular solution that smooths the loss by minimizing the change of landscape when adding a perturbation. In this paper, we propose Sparse SAM (SSAM), an efficient and effective training scheme that achieves perturbation by a binary mask.
arXiv Detail & Related papers (2023-06-30T09:33:41Z)
On Statistical Properties of Sharpness-Aware Minimization: Provable Guarantees [5.91402820967386]
We present a new theoretical explanation of why Sharpness-Aware Minimization (SAM) generalizes well. SAM is particularly well-suited for both sharp and non-sharp problems. Our findings are validated using numerical experiments on deep neural networks.
arXiv Detail & Related papers (2023-02-23T07:52:31Z)
Improved Deep Neural Network Generalization Using m-Sharpness-Aware Minimization [14.40189851070842]
Sharpness-Aware Minimization (SAM) modifies the underlying loss function to guide descent methods towards flatter minima. Recent work suggests that mSAM can outperform SAM in terms of test accuracy. This paper presents a comprehensive empirical evaluation of mSAM on various tasks and datasets.
arXiv Detail & Related papers (2022-12-07T00:37:55Z)
Improving Sharpness-Aware Minimization with Fisher Mask for Better Generalization on Language Models [93.85178920914721]
Fine-tuning large pretrained language models on a limited training corpus usually suffers from poor computation. We propose a novel optimization procedure, namely FSAM, which introduces a Fisher mask to improve the efficiency and performance of SAM. We show that FSAM consistently outperforms the vanilla SAM by 0.671.98 average score among four different pretrained models.
arXiv Detail & Related papers (2022-10-11T14:53:58Z)
Sharpness-Aware Training for Free [163.1248341911413]
SharpnessAware Minimization (SAM) has shown that minimizing a sharpness measure, which reflects the geometry of the loss landscape, can significantly reduce the generalization error. Sharpness-Aware Training Free (SAF) mitigates the sharp landscape at almost zero computational cost over the base. SAF ensures the convergence to a flat minimum with improved capabilities.
arXiv Detail & Related papers (2022-05-27T16:32:43Z)
Randomized Sharpness-Aware Training for Boosting Computational Efficiency in Deep Learning [13.937644559223548]
We propose a simple yet efficient training scheme, called Randomized Sharpness-Aware Training (RST).s in RST would perform a Bernoulli trial at each iteration to choose randomly from base algorithms (SGD) and sharpness-aware algorithms (SAM) We show that G-RST can outperform SAM in most cases while saving 50% extra cost.
arXiv Detail & Related papers (2022-03-18T13:57:17Z)
Efficient Sharpness-aware Minimization for Improved Training of Neural Networks [146.2011175973769]
This paper proposes Efficient Sharpness Aware Minimizer (M) which boosts SAM s efficiency at no cost to its generalization performance. M includes two novel and efficient training strategies-StochasticWeight Perturbation and Sharpness-Sensitive Data Selection. We show, via extensive experiments on the CIFAR and ImageNet datasets, that ESAM enhances the efficiency over SAM from requiring 100% extra computations to 40% vis-a-vis bases.
arXiv Detail & Related papers (2021-10-07T02:20:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.