AdaSAM: Boosting Sharpness-Aware Minimization with Adaptive Learning
Rate and Momentum for Training Deep Neural Networks
- URL: http://arxiv.org/abs/2303.00565v1
- Date: Wed, 1 Mar 2023 15:12:42 GMT
- Title: AdaSAM: Boosting Sharpness-Aware Minimization with Adaptive Learning
Rate and Momentum for Training Deep Neural Networks
- Authors: Hao Sun, Li Shen, Qihuang Zhong, Liang Ding, Shixiang Chen, Jingwei
Sun, Jing Li, Guangzhong Sun, Dacheng Tao
- Abstract summary: Sharpness aware (SAM) has been extensively explored as it can generalize better for training deep neural networks.
Integrating SAM with adaptive learning perturbation and momentum acceleration, dubbed AdaSAM, has already been explored.
We conduct several experiments on several NLP tasks, which show that AdaSAM could achieve superior performance compared with SGD, AMS, and SAMsGrad.
- Score: 76.90477930208982
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sharpness aware minimization (SAM) optimizer has been extensively explored as
it can generalize better for training deep neural networks via introducing
extra perturbation steps to flatten the landscape of deep learning models.
Integrating SAM with adaptive learning rate and momentum acceleration, dubbed
AdaSAM, has already been explored empirically to train large-scale deep neural
networks without theoretical guarantee due to the triple difficulties in
analyzing the coupled perturbation step, adaptive learning rate and momentum
step. In this paper, we try to analyze the convergence rate of AdaSAM in the
stochastic non-convex setting. We theoretically show that AdaSAM admits a
$\mathcal{O}(1/\sqrt{bT})$ convergence rate, which achieves linear speedup
property with respect to mini-batch size $b$. Specifically, to decouple the
stochastic gradient steps with the adaptive learning rate and perturbed
gradient, we introduce the delayed second-order momentum term to decompose them
to make them independent while taking an expectation during the analysis. Then
we bound them by showing the adaptive learning rate has a limited range, which
makes our analysis feasible. To the best of our knowledge, we are the first to
provide the non-trivial convergence rate of SAM with an adaptive learning rate
and momentum acceleration. At last, we conduct several experiments on several
NLP tasks, which show that AdaSAM could achieve superior performance compared
with SGD, AMSGrad, and SAM optimizers.
Related papers
- $\boldsymbolμ\mathbf{P^2}$: Effective Sharpness Aware Minimization Requires Layerwise Perturbation Scaling [49.25546155981064]
We study the infinite-width limit of neural networks trained with Sharpness Aware Minimization (SAM)
Our findings reveal that the dynamics of standard SAM effectively reduce to applying SAM solely in the last layer in wide neural networks.
In contrast, we identify a stable parameterization with layerwise scaling perturbation, which we call $textitMaximal Update and Perturbation $ ($mu$P$2$), that ensures all layers are both feature learning and effectively perturbed in the limit.
arXiv Detail & Related papers (2024-10-31T16:32:04Z) - SAMPa: Sharpness-aware Minimization Parallelized [51.668052890249726]
Sharpness-aware (SAM) has been shown to improve the generalization of neural networks.
Each SAM update requires emphsequentially computing two gradients, effectively doubling the per-iteration cost.
We propose a simple modification of SAM, termed SAMPa, which allows us to fully parallelize the two gradient computations.
arXiv Detail & Related papers (2024-10-14T16:21:23Z) - Sharpness-Aware Minimization Efficiently Selects Flatter Minima Late in Training [47.25594539120258]
We find that Sharpness-Aware Minimization (SAM) efficiently selects flatter minima late in training.
Even a few epochs of SAM applied at the end of training yield nearly the same generalization and solution sharpness as full SAM training.
We conjecture that the optimization method chosen in the late phase is more crucial in shaping the final solution's properties.
arXiv Detail & Related papers (2024-10-14T10:56:42Z) - Asymptotic Unbiased Sample Sampling to Speed Up Sharpness-Aware Minimization [17.670203551488218]
We propose Asymptotic Unbiased Sampling to accelerate Sharpness-Aware Minimization (AUSAM)
AUSAM maintains the model's generalization capacity while significantly enhancing computational efficiency.
As a plug-and-play, architecture-agnostic method, our approach consistently accelerates SAM across a range of tasks and networks.
arXiv Detail & Related papers (2024-06-12T08:47:44Z) - Stabilizing Sharpness-aware Minimization Through A Simple Renormalization Strategy [12.050160495730381]
sharpness-aware generalization (SAM) has attracted much attention because of its surprising effectiveness in improving performance.
We propose a simple renormalization strategy, dubbed Stable SAM (SSAM), so that the gradient norm of the descent step maintains the same as that of the ascent step.
Our strategy is easy to implement and flexible enough to integrate with SAM and its variants, almost at no computational cost.
arXiv Detail & Related papers (2024-01-14T10:53:36Z) - Systematic Investigation of Sparse Perturbed Sharpness-Aware
Minimization Optimizer [158.2634766682187]
Deep neural networks often suffer from poor generalization due to complex and non- unstructured loss landscapes.
SharpnessAware Minimization (SAM) is a popular solution that smooths the loss by minimizing the change of landscape when adding a perturbation.
In this paper, we propose Sparse SAM (SSAM), an efficient and effective training scheme that achieves perturbation by a binary mask.
arXiv Detail & Related papers (2023-06-30T09:33:41Z) - Towards Efficient and Scalable Sharpness-Aware Minimization [81.22779501753695]
We propose a novel algorithm LookSAM that only periodically calculates the inner gradient ascent.
LookSAM achieves similar accuracy gains to SAM while being tremendously faster.
We are the first to successfully scale up the batch size when training Vision Transformers (ViTs)
arXiv Detail & Related papers (2022-03-05T11:53:37Z) - Stochastic Anderson Mixing for Nonconvex Stochastic Optimization [12.65903351047816]
Anderson mixing (AM) is an acceleration method for fixed-point iterations.
We propose a Mixing (SAM) scheme to solve non adaptive optimization problems.
arXiv Detail & Related papers (2021-10-04T16:26:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.