Why Does Sharpness-Aware Minimization Generalize Better Than SGD?
- URL: http://arxiv.org/abs/2310.07269v1
- Date: Wed, 11 Oct 2023 07:51:10 GMT
- Title: Why Does Sharpness-Aware Minimization Generalize Better Than SGD?
- Authors: Zixiang Chen and Junkai Zhang and Yiwen Kou and Xiangning Chen and
Cho-Jui Hsieh and Quanquan Gu
- Abstract summary: We show why Sharpness-Aware Minimization (SAM) generalizes better than Gradient Descent (SGD) for certain data model and two-layer convolutional ReLU networks.
Our result explains the benefits of SAM, particularly its ability to prevent noise learning in the early stages, thereby facilitating more effective learning of features.
- Score: 102.40907275290891
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The challenge of overfitting, in which the model memorizes the training data
and fails to generalize to test data, has become increasingly significant in
the training of large neural networks. To tackle this challenge,
Sharpness-Aware Minimization (SAM) has emerged as a promising training method,
which can improve the generalization of neural networks even in the presence of
label noise. However, a deep understanding of how SAM works, especially in the
setting of nonlinear neural networks and classification tasks, remains largely
missing. This paper fills this gap by demonstrating why SAM generalizes better
than Stochastic Gradient Descent (SGD) for a certain data model and two-layer
convolutional ReLU networks. The loss landscape of our studied problem is
nonsmooth, thus current explanations for the success of SAM based on the
Hessian information are insufficient. Our result explains the benefits of SAM,
particularly its ability to prevent noise learning in the early stages, thereby
facilitating more effective learning of features. Experiments on both synthetic
and real data corroborate our theory.
Related papers
- Sharpness-Aware Minimization Efficiently Selects Flatter Minima Late in Training [47.25594539120258]
We find that Sharpness-Aware Minimization (SAM) efficiently selects flatter minima late in training.
Even a few epochs of SAM applied at the end of training yield nearly the same generalization and solution sharpness as full SAM training.
We conjecture that the optimization method chosen in the late phase is more crucial in shaping the final solution's properties.
arXiv Detail & Related papers (2024-10-14T10:56:42Z) - Sharpness-Aware Minimization Enhances Feature Quality via Balanced Learning [17.708350046115616]
Sharpness-Aware Minimization (SAM) has emerged as a promising alternative to gradient descent (SGD)
We show that SAM achieves this effect by adaptively suppressing well-learned features which gives remaining features opportunity to be learned.
Our insights are supported by experiments on real data, including CelebA, Waterbirds, CIFAR-MNIST, and DomainBed.
arXiv Detail & Related papers (2024-05-30T19:32:56Z) - Friendly Sharpness-Aware Minimization [62.57515991835801]
Sharpness-Aware Minimization (SAM) has been instrumental in improving deep neural network training by minimizing both training loss and loss sharpness.
We investigate the key role of batch-specific gradient noise within the adversarial perturbation, i.e., the current minibatch gradient.
By decomposing the adversarial gradient noise components, we discover that relying solely on the full gradient degrades generalization while excluding it leads to improved performance.
arXiv Detail & Related papers (2024-03-19T01:39:33Z) - Systematic Investigation of Sparse Perturbed Sharpness-Aware
Minimization Optimizer [158.2634766682187]
Deep neural networks often suffer from poor generalization due to complex and non- unstructured loss landscapes.
SharpnessAware Minimization (SAM) is a popular solution that smooths the loss by minimizing the change of landscape when adding a perturbation.
In this paper, we propose Sparse SAM (SSAM), an efficient and effective training scheme that achieves perturbation by a binary mask.
arXiv Detail & Related papers (2023-06-30T09:33:41Z) - AdaSAM: Boosting Sharpness-Aware Minimization with Adaptive Learning
Rate and Momentum for Training Deep Neural Networks [76.90477930208982]
Sharpness aware (SAM) has been extensively explored as it can generalize better for training deep neural networks.
Integrating SAM with adaptive learning perturbation and momentum acceleration, dubbed AdaSAM, has already been explored.
We conduct several experiments on several NLP tasks, which show that AdaSAM could achieve superior performance compared with SGD, AMS, and SAMsGrad.
arXiv Detail & Related papers (2023-03-01T15:12:42Z) - On Statistical Properties of Sharpness-Aware Minimization: Provable
Guarantees [5.91402820967386]
We present a new theoretical explanation of why Sharpness-Aware Minimization (SAM) generalizes well.
SAM is particularly well-suited for both sharp and non-sharp problems.
Our findings are validated using numerical experiments on deep neural networks.
arXiv Detail & Related papers (2023-02-23T07:52:31Z) - Improved Deep Neural Network Generalization Using m-Sharpness-Aware
Minimization [14.40189851070842]
Sharpness-Aware Minimization (SAM) modifies the underlying loss function to guide descent methods towards flatter minima.
Recent work suggests that mSAM can outperform SAM in terms of test accuracy.
This paper presents a comprehensive empirical evaluation of mSAM on various tasks and datasets.
arXiv Detail & Related papers (2022-12-07T00:37:55Z) - Efficient Sharpness-aware Minimization for Improved Training of Neural
Networks [146.2011175973769]
This paper proposes Efficient Sharpness Aware Minimizer (M) which boosts SAM s efficiency at no cost to its generalization performance.
M includes two novel and efficient training strategies-StochasticWeight Perturbation and Sharpness-Sensitive Data Selection.
We show, via extensive experiments on the CIFAR and ImageNet datasets, that ESAM enhances the efficiency over SAM from requiring 100% extra computations to 40% vis-a-vis bases.
arXiv Detail & Related papers (2021-10-07T02:20:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.