Surrogate Gap Minimization Improves Sharpness-Aware Training
- URL: http://arxiv.org/abs/2203.08065v1
- Date: Tue, 15 Mar 2022 16:57:59 GMT
- Title: Surrogate Gap Minimization Improves Sharpness-Aware Training
- Authors: Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam,
Nicha Dvornek, Sekhar Tatikonda, James Duncan, Ting Liu
- Abstract summary: Surrogate textbfGap Guided textbfSharpness-textbfAware textbfMinimization (GSAM) is a novel improvement over Sharpness-Aware Minimization (SAM) with negligible computation overhead.
GSAM seeks a region with both small loss (by step 1) and low sharpness (by step 2), giving rise to a model with high generalization capabilities.
- Score: 52.58252223573646
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recently proposed Sharpness-Aware Minimization (SAM) improves
generalization by minimizing a \textit{perturbed loss} defined as the maximum
loss within a neighborhood in the parameter space. However, we show that both
sharp and flat minima can have a low perturbed loss, implying that SAM does not
always prefer flat minima. Instead, we define a \textit{surrogate gap}, a
measure equivalent to the dominant eigenvalue of Hessian at a local minimum
when the radius of the neighborhood (to derive the perturbed loss) is small.
The surrogate gap is easy to compute and feasible for direct minimization
during training. Based on the above observations, we propose Surrogate
\textbf{G}ap Guided \textbf{S}harpness-\textbf{A}ware \textbf{M}inimization
(GSAM), a novel improvement over SAM with negligible computation overhead.
Conceptually, GSAM consists of two steps: 1) a gradient descent like SAM to
minimize the perturbed loss, and 2) an \textit{ascent} step in the
\textit{orthogonal} direction (after gradient decomposition) to minimize the
surrogate gap and yet not affect the perturbed loss. GSAM seeks a region with
both small loss (by step 1) and low sharpness (by step 2), giving rise to a
model with high generalization capabilities. Theoretically, we show the
convergence of GSAM and provably better generalization than SAM. Empirically,
GSAM consistently improves generalization (e.g., +3.2\% over SAM and +5.4\%
over AdamW on ImageNet top-1 accuracy for ViT-B/32). Code is released at \url{
https://sites.google.com/view/gsam-iclr22/home}.
Related papers
- Bilateral Sharpness-Aware Minimization for Flatter Minima [61.17349662062522]
Sharpness-Aware Minimization (SAM) enhances generalization by reducing a Max-Sharpness (MaxS)
In this paper, we propose to utilize the difference between the training loss and the minimum loss over the neighborhood surrounding the current weight, which we denote as Min-Sharpness (MinS)
By merging MaxS and MinS, we created a better FI that indicates a flatter direction during the optimization. Specially, we combine this FI with SAM into the proposed Bilateral SAM (BSAM) which finds a more flatter minimum than that of SAM.
arXiv Detail & Related papers (2024-09-20T03:01:13Z) - Improving SAM Requires Rethinking its Optimization Formulation [57.601718870423454]
Sharpness-Aware Minimization (SAM) is originally formulated as a zero-sum game where the weights of a network and a bounded perturbation try to minimize/maximize, respectively, the same differentiable loss.
We argue that SAM should instead be reformulated using the 0-1 loss. As a continuous relaxation, we follow the simple conventional approach where the minimizing (maximizing) player uses an upper bound (lower bound) surrogate to the 0-1 loss. This leads to a novel formulation of SAM as a bilevel optimization problem, dubbed as BiSAM.
arXiv Detail & Related papers (2024-07-17T20:22:33Z) - Efficient Sharpness-Aware Minimization for Molecular Graph Transformer Models [42.59948316941217]
Sharpness-aware minimization (SAM) has received increasing attention in computer vision since it can effectively eliminate the sharp local minima from the training trajectory and generalization degradation.
We propose a new algorithm named GraphSAM, which reduces the training cost of SAM and improves the generalization performance of graph transformer models.
arXiv Detail & Related papers (2024-06-19T01:03:23Z) - Systematic Investigation of Sparse Perturbed Sharpness-Aware
Minimization Optimizer [158.2634766682187]
Deep neural networks often suffer from poor generalization due to complex and non- unstructured loss landscapes.
SharpnessAware Minimization (SAM) is a popular solution that smooths the loss by minimizing the change of landscape when adding a perturbation.
In this paper, we propose Sparse SAM (SSAM), an efficient and effective training scheme that achieves perturbation by a binary mask.
arXiv Detail & Related papers (2023-06-30T09:33:41Z) - Sharpness-Aware Gradient Matching for Domain Generalization [84.14789746460197]
The goal of domain generalization (DG) is to enhance the generalization capability of the model learned from a source domain to other unseen domains.
The recently developed Sharpness-Aware Minimization (SAM) method aims to achieve this goal by minimizing the sharpness measure of the loss landscape.
We present two conditions to ensure that the model could converge to a flat minimum with a small loss, and present an algorithm, named Sharpness-Aware Gradient Matching (SAGM)
Our proposed SAGM method consistently outperforms the state-of-the-art methods on five DG benchmarks.
arXiv Detail & Related papers (2023-03-18T07:25:12Z) - Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves
Generalization [33.50116027503244]
We show that the zeroth-order flatness can be insufficient to discriminate minima with low gradient error.
We also present a novel training procedure named Gradient norm Aware Minimization (GAM) to seek minima with uniformly small curvature across all directions.
arXiv Detail & Related papers (2023-03-03T16:58:53Z) - Make Sharpness-Aware Minimization Stronger: A Sparsified Perturbation
Approach [132.37966970098645]
One of the popular solutions is Sharpness-Aware Minimization (SAM), which minimizes the change of weight loss when adding a perturbation.
In this paper, we propose an efficient effective training scheme coined as Sparse SAM (SSAM), which achieves double overhead of common perturbations.
In addition, we theoretically prove that S can converge at the same SAM, i.e., $O(log T/sqrtTTTTTTTTTTTTTTTTT
arXiv Detail & Related papers (2022-10-11T06:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.