Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves
Generalization
- URL: http://arxiv.org/abs/2303.03108v3
- Date: Tue, 4 Jul 2023 04:17:43 GMT
- Title: Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves
Generalization
- Authors: Xingxuan Zhang and Renzhe Xu and Han Yu and Hao Zou and Peng Cui
- Abstract summary: We show that the zeroth-order flatness can be insufficient to discriminate minima with low gradient error.
We also present a novel training procedure named Gradient norm Aware Minimization (GAM) to seek minima with uniformly small curvature across all directions.
- Score: 33.50116027503244
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, flat minima are proven to be effective for improving generalization
and sharpness-aware minimization (SAM) achieves state-of-the-art performance.
Yet the current definition of flatness discussed in SAM and its follow-ups are
limited to the zeroth-order flatness (i.e., the worst-case loss within a
perturbation radius). We show that the zeroth-order flatness can be
insufficient to discriminate minima with low generalization error from those
with high generalization error both when there is a single minimum or multiple
minima within the given perturbation radius. Thus we present first-order
flatness, a stronger measure of flatness focusing on the maximal gradient norm
within a perturbation radius which bounds both the maximal eigenvalue of
Hessian at local minima and the regularization function of SAM. We also present
a novel training procedure named Gradient norm Aware Minimization (GAM) to seek
minima with uniformly small curvature across all directions. Experimental
results show that GAM improves the generalization of models trained with
current optimizers such as SGD and AdamW on various datasets and networks.
Furthermore, we show that GAM can help SAM find flatter minima and achieve
better generalization.
Related papers
- Reweighting Local Mimina with Tilted SAM [24.689230137012174]
Sharpness-Aware Minimization (SAM) has been demonstrated to improve the generalization performance of over infinity by seeking flat minima on flatter loss.
In this work, we propose TSAM (TSAM) that effectively assigns higher priority to local solutions that are flatter and that incur losses.
arXiv Detail & Related papers (2024-10-30T02:49:48Z) - Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node Classification [53.727688136434345]
Graph Neural Networks (GNNs) have shown superior performance in node classification.
We present Fast Graph Sharpness-Aware Minimization (FGSAM) that integrates the rapid training of Multi-Layer Perceptrons with the superior performance of GNNs.
Our proposed algorithm outperforms the standard SAM with lower computational costs in FSNC tasks.
arXiv Detail & Related papers (2024-10-22T09:33:29Z) - Bilateral Sharpness-Aware Minimization for Flatter Minima [61.17349662062522]
Sharpness-Aware Minimization (SAM) enhances generalization by reducing a Max-Sharpness (MaxS)
In this paper, we propose to utilize the difference between the training loss and the minimum loss over the neighborhood surrounding the current weight, which we denote as Min-Sharpness (MinS)
By merging MaxS and MinS, we created a better FI that indicates a flatter direction during the optimization. Specially, we combine this FI with SAM into the proposed Bilateral SAM (BSAM) which finds a more flatter minimum than that of SAM.
arXiv Detail & Related papers (2024-09-20T03:01:13Z) - Agnostic Sharpness-Aware Minimization [29.641227264358704]
Sharpness-aware (SAM) has been instrumental in improving deep neural network training by minimizing both the training loss and the sharpness of the loss landscape.
Model-Agnostic Meta-Learning (MAML) is a framework designed to improve the adaptability of models.
We introduce Agnostic-SAM, a novel approach that combines the principles of both SAM and MAML.
arXiv Detail & Related papers (2024-06-11T09:49:00Z) - The Inductive Bias of Flatness Regularization for Deep Matrix
Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks.
We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z) - Sharpness-Aware Gradient Matching for Domain Generalization [84.14789746460197]
The goal of domain generalization (DG) is to enhance the generalization capability of the model learned from a source domain to other unseen domains.
The recently developed Sharpness-Aware Minimization (SAM) method aims to achieve this goal by minimizing the sharpness measure of the loss landscape.
We present two conditions to ensure that the model could converge to a flat minimum with a small loss, and present an algorithm, named Sharpness-Aware Gradient Matching (SAGM)
Our proposed SAGM method consistently outperforms the state-of-the-art methods on five DG benchmarks.
arXiv Detail & Related papers (2023-03-18T07:25:12Z) - Why is parameter averaging beneficial in SGD? An objective smoothing perspective [13.863368438870562]
gradient descent (SGD) and its implicit bias are often characterized in terms of the sharpness of the minima.
We study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al.
We prove that averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima.
arXiv Detail & Related papers (2023-02-18T16:29:06Z) - GA-SAM: Gradient-Strength based Adaptive Sharpness-Aware Minimization
for Improved Generalization [22.53923556656022]
Sharpness-Aware Minimization (SAM) algorithm has shown state-of-the-art generalization abilities in vision tasks.
SAM has some difficulty implying SAM to some natural language tasks, especially to models with drastic changes, such as RNNs.
We propose a Gradient-Strength based Adaptive Sharpness-Aware Minimization (GA-SAM) algorithm to help learn algorithms find flat minima that generalize better.
arXiv Detail & Related papers (2022-10-13T10:44:10Z) - Surrogate Gap Minimization Improves Sharpness-Aware Training [52.58252223573646]
Surrogate textbfGap Guided textbfSharpness-textbfAware textbfMinimization (GSAM) is a novel improvement over Sharpness-Aware Minimization (SAM) with negligible computation overhead.
GSAM seeks a region with both small loss (by step 1) and low sharpness (by step 2), giving rise to a model with high generalization capabilities.
arXiv Detail & Related papers (2022-03-15T16:57:59Z) - A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient
Descent Exponentially Favors Flat Minima [91.11332770406007]
We show that Gradient Descent (SGD) favors flat minima exponentially more than sharp minima.
We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima.
arXiv Detail & Related papers (2020-02-10T02:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.