AdaSAM: Boosting Sharpness-Aware Minimization with Adaptive Learning
Rate and Momentum for Training Deep Neural Networks
- URL: http://arxiv.org/abs/2303.00565v1
- Date: Wed, 1 Mar 2023 15:12:42 GMT
- Title: AdaSAM: Boosting Sharpness-Aware Minimization with Adaptive Learning
Rate and Momentum for Training Deep Neural Networks
- Authors: Hao Sun, Li Shen, Qihuang Zhong, Liang Ding, Shixiang Chen, Jingwei
Sun, Jing Li, Guangzhong Sun, Dacheng Tao
- Abstract summary: Sharpness aware (SAM) has been extensively explored as it can generalize better for training deep neural networks.
Integrating SAM with adaptive learning perturbation and momentum acceleration, dubbed AdaSAM, has already been explored.
We conduct several experiments on several NLP tasks, which show that AdaSAM could achieve superior performance compared with SGD, AMS, and SAMsGrad.
- Score: 76.90477930208982
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sharpness aware minimization (SAM) optimizer has been extensively explored as
it can generalize better for training deep neural networks via introducing
extra perturbation steps to flatten the landscape of deep learning models.
Integrating SAM with adaptive learning rate and momentum acceleration, dubbed
AdaSAM, has already been explored empirically to train large-scale deep neural
networks without theoretical guarantee due to the triple difficulties in
analyzing the coupled perturbation step, adaptive learning rate and momentum
step. In this paper, we try to analyze the convergence rate of AdaSAM in the
stochastic non-convex setting. We theoretically show that AdaSAM admits a
$\mathcal{O}(1/\sqrt{bT})$ convergence rate, which achieves linear speedup
property with respect to mini-batch size $b$. Specifically, to decouple the
stochastic gradient steps with the adaptive learning rate and perturbed
gradient, we introduce the delayed second-order momentum term to decompose them
to make them independent while taking an expectation during the analysis. Then
we bound them by showing the adaptive learning rate has a limited range, which
makes our analysis feasible. To the best of our knowledge, we are the first to
provide the non-trivial convergence rate of SAM with an adaptive learning rate
and momentum acceleration. At last, we conduct several experiments on several
NLP tasks, which show that AdaSAM could achieve superior performance compared
with SGD, AMSGrad, and SAM optimizers.
Related papers
- Asymptotic Unbiased Sample Sampling to Speed Up Sharpness-Aware Minimization [17.670203551488218]
We propose Asymptotic Unbiased Sampling to accelerate Sharpness-Aware Minimization (AUSAM)
AUSAM maintains the model's generalization capacity while significantly enhancing computational efficiency.
As a plug-and-play, architecture-agnostic method, our approach consistently accelerates SAM across a range of tasks and networks.
arXiv Detail & Related papers (2024-06-12T08:47:44Z) - Friendly Sharpness-Aware Minimization [62.57515991835801]
Sharpness-Aware Minimization (SAM) has been instrumental in improving deep neural network training by minimizing both training loss and loss sharpness.
We investigate the key role of batch-specific gradient noise within the adversarial perturbation, i.e., the current minibatch gradient.
By decomposing the adversarial gradient noise components, we discover that relying solely on the full gradient degrades generalization while excluding it leads to improved performance.
arXiv Detail & Related papers (2024-03-19T01:39:33Z) - Stabilizing Sharpness-aware Minimization Through A Simple
Renormalization Strategy [12.927965934262847]
Training neural networks with sharpness-aware (SAM) can be highly unstable.
We propose a simple renormalization strategy, dubbed StableSAM, so that the norm of the surrogate gradient maintains the same as that of the exact gradient.
We show how StableSAM extends this regime of learning rate and when it can consistently perform better than SAM with minor modification.
arXiv Detail & Related papers (2024-01-14T10:53:36Z) - Critical Influence of Overparameterization on Sharpness-aware Minimization [12.321517302762558]
We show that sharpness-aware minimization (SAM) strategy is affected by over parameterization.
We prove multiple theoretical benefits of over parameterization for SAM to attain (i) minima with more uniform Hessian moments compared to SGD, (ii) much faster convergence at a linear rate, and (iii) lower test error for two-layer networks.
arXiv Detail & Related papers (2023-11-29T11:19:50Z) - Why Does Sharpness-Aware Minimization Generalize Better Than SGD? [102.40907275290891]
We show why Sharpness-Aware Minimization (SAM) generalizes better than Gradient Descent (SGD) for certain data model and two-layer convolutional ReLU networks.
Our result explains the benefits of SAM, particularly its ability to prevent noise learning in the early stages, thereby facilitating more effective learning of features.
arXiv Detail & Related papers (2023-10-11T07:51:10Z) - Systematic Investigation of Sparse Perturbed Sharpness-Aware
Minimization Optimizer [158.2634766682187]
Deep neural networks often suffer from poor generalization due to complex and non- unstructured loss landscapes.
SharpnessAware Minimization (SAM) is a popular solution that smooths the loss by minimizing the change of landscape when adding a perturbation.
In this paper, we propose Sparse SAM (SSAM), an efficient and effective training scheme that achieves perturbation by a binary mask.
arXiv Detail & Related papers (2023-06-30T09:33:41Z) - Towards Efficient and Scalable Sharpness-Aware Minimization [81.22779501753695]
We propose a novel algorithm LookSAM that only periodically calculates the inner gradient ascent.
LookSAM achieves similar accuracy gains to SAM while being tremendously faster.
We are the first to successfully scale up the batch size when training Vision Transformers (ViTs)
arXiv Detail & Related papers (2022-03-05T11:53:37Z) - Efficient Sharpness-aware Minimization for Improved Training of Neural
Networks [146.2011175973769]
This paper proposes Efficient Sharpness Aware Minimizer (M) which boosts SAM s efficiency at no cost to its generalization performance.
M includes two novel and efficient training strategies-StochasticWeight Perturbation and Sharpness-Sensitive Data Selection.
We show, via extensive experiments on the CIFAR and ImageNet datasets, that ESAM enhances the efficiency over SAM from requiring 100% extra computations to 40% vis-a-vis bases.
arXiv Detail & Related papers (2021-10-07T02:20:37Z) - Stochastic Anderson Mixing for Nonconvex Stochastic Optimization [12.65903351047816]
Anderson mixing (AM) is an acceleration method for fixed-point iterations.
We propose a Mixing (SAM) scheme to solve non adaptive optimization problems.
arXiv Detail & Related papers (2021-10-04T16:26:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.