Critical Influence of Overparameterization on Sharpness-aware Minimization
- URL: http://arxiv.org/abs/2311.17539v3
- Date: Thu, 20 Jun 2024 01:40:54 GMT
- Title: Critical Influence of Overparameterization on Sharpness-aware Minimization
- Authors: Sungbin Shin, Dongyeop Lee, Maksym Andriushchenko, Namhoon Lee,
- Abstract summary: We show that sharpness-aware minimization (SAM) strategy is affected by over parameterization.
We prove multiple theoretical benefits of over parameterization for SAM to attain (i) minima with more uniform Hessian moments compared to SGD, (ii) much faster convergence at a linear rate, and (iii) lower test error for two-layer networks.
- Score: 12.321517302762558
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training an overparameterized neural network can yield minimizers of different generalization capabilities despite the same level of training loss. Meanwhile, with evidence that suggests a strong correlation between the sharpness of minima and their generalization errors, increasing efforts have been made to develop optimization methods to explicitly find flat minima as more generalizable solutions. Despite its contemporary relevance to overparameterization, however, this sharpness-aware minimization (SAM) strategy has not been studied much yet as to exactly how it is affected by overparameterization. Hence, in this work, we analyze SAM under overparameterization of varying degrees and present both empirical and theoretical results that indicate a critical influence of overparameterization on SAM. At first, we conduct extensive numerical experiments across vision, language, graph, and reinforcement learning domains and show that SAM consistently improves with overparameterization. Next, we attribute this phenomenon to the interplay between the enlarged solution space and increased implicit bias from overparameterization. Further, we prove multiple theoretical benefits of overparameterization for SAM to attain (i) minima with more uniform Hessian moments compared to SGD, (ii) much faster convergence at a linear rate, and (iii) lower test error for two-layer networks. Last but not least, we discover that the effect of overparameterization is more significantly pronounced in practical settings of label noise and sparsity, and yet, sufficient regularization is necessary.
Related papers
- $\boldsymbolμ\mathbf{P^2}$: Effective Sharpness Aware Minimization Requires Layerwise Perturbation Scaling [49.25546155981064]
We study the infinite-width limit of neural networks trained with Sharpness Aware Minimization (SAM)
Our findings reveal that the dynamics of standard SAM effectively reduce to applying SAM solely in the last layer in wide neural networks.
In contrast, we identify a stable parameterization with layerwise scaling perturbation, which we call $textitMaximal Update and Perturbation $ ($mu$P$2$), that ensures all layers are both feature learning and effectively perturbed in the limit.
arXiv Detail & Related papers (2024-10-31T16:32:04Z) - A Universal Class of Sharpness-Aware Minimization Algorithms [57.29207151446387]
We introduce a new class of sharpness measures, leading to new sharpness-aware objective functions.
We prove that these measures are textitly expressive, allowing any function of the training loss Hessian matrix to be represented by appropriate hyper and determinants.
arXiv Detail & Related papers (2024-06-06T01:52:09Z) - Normalization Layers Are All That Sharpness-Aware Minimization Needs [53.799769473526275]
Sharpness-aware minimization (SAM) was proposed to reduce sharpness of minima.
We show that perturbing only the affine normalization parameters (typically comprising 0.1% of the total parameters) in the adversarial step of SAM can outperform perturbing all of the parameters.
arXiv Detail & Related papers (2023-06-07T08:05:46Z) - The Crucial Role of Normalization in Sharpness-Aware Minimization [44.00155917998616]
Sharpness-Aware Minimization (SAM) is a gradient-based neural network that greatly improves prediction performance.
We argue that two properties of normalization make SAM robust against the choice of hyper- practicalitys.
arXiv Detail & Related papers (2023-05-24T16:09:41Z) - AdaSAM: Boosting Sharpness-Aware Minimization with Adaptive Learning
Rate and Momentum for Training Deep Neural Networks [76.90477930208982]
Sharpness aware (SAM) has been extensively explored as it can generalize better for training deep neural networks.
Integrating SAM with adaptive learning perturbation and momentum acceleration, dubbed AdaSAM, has already been explored.
We conduct several experiments on several NLP tasks, which show that AdaSAM could achieve superior performance compared with SGD, AMS, and SAMsGrad.
arXiv Detail & Related papers (2023-03-01T15:12:42Z) - Stability Analysis of Sharpness-Aware Minimization [5.024497308975435]
Sharpness-aware (SAM) is a recently proposed training method that seeks to find flat minima in deep learning.
In this paper, we demonstrate that SAM dynamics can have convergence instability that occurs near a saddle point.
arXiv Detail & Related papers (2023-01-16T08:42:40Z) - Sharpness-Aware Training for Free [163.1248341911413]
SharpnessAware Minimization (SAM) has shown that minimizing a sharpness measure, which reflects the geometry of the loss landscape, can significantly reduce the generalization error.
Sharpness-Aware Training Free (SAF) mitigates the sharp landscape at almost zero computational cost over the base.
SAF ensures the convergence to a flat minimum with improved capabilities.
arXiv Detail & Related papers (2022-05-27T16:32:43Z) - Parameters or Privacy: A Provable Tradeoff Between Overparameterization
and Membership Inference [29.743945643424553]
Over parameterized models generalize well (small error on the test data) even when trained to memorize the training data (zero error on the training data)
This has led to an arms race towards increasingly over parameterized models (c.f., deep learning)
arXiv Detail & Related papers (2022-02-02T19:00:21Z) - Efficient Sharpness-aware Minimization for Improved Training of Neural
Networks [146.2011175973769]
This paper proposes Efficient Sharpness Aware Minimizer (M) which boosts SAM s efficiency at no cost to its generalization performance.
M includes two novel and efficient training strategies-StochasticWeight Perturbation and Sharpness-Sensitive Data Selection.
We show, via extensive experiments on the CIFAR and ImageNet datasets, that ESAM enhances the efficiency over SAM from requiring 100% extra computations to 40% vis-a-vis bases.
arXiv Detail & Related papers (2021-10-07T02:20:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.