Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To
Achieve Better Generalization
- URL: http://arxiv.org/abs/2307.11007v2
- Date: Sun, 23 Jul 2023 03:59:44 GMT
- Title: Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To
Achieve Better Generalization
- Authors: Kaiyue Wen, Zhiyuan Li, Tengyu Ma
- Abstract summary: Existing theory shows that common architectures prefer flatter minimizers of the training loss.
This work critically examines this explanation.
Our results suggest that the relationship between sharpness and generalization subtly depends on the data.
- Score: 29.90109733192208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite extensive studies, the underlying reason as to why overparameterized
neural networks can generalize remains elusive. Existing theory shows that
common stochastic optimizers prefer flatter minimizers of the training loss,
and thus a natural potential explanation is that flatness implies
generalization. This work critically examines this explanation. Through
theoretical and empirical investigation, we identify the following three
scenarios for two-layer ReLU networks: (1) flatness provably implies
generalization; (2) there exist non-generalizing flattest models and sharpness
minimization algorithms fail to generalize, and (3) perhaps most surprisingly,
there exist non-generalizing flattest models, but sharpness minimization
algorithms still generalize. Our results suggest that the relationship between
sharpness and generalization subtly depends on the data distributions and the
model architectures and sharpness minimization algorithms do not only minimize
sharpness to achieve better generalization. This calls for the search for other
explanations for the generalization of over-parameterized neural networks.
Related papers
- Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization [52.16435732772263]
Second-order optimization has been shown to accelerate the training of deep neural networks in many applications.
However, generalization properties of second-order methods are still being debated.
We show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep architectures.
arXiv Detail & Related papers (2024-11-12T17:58:40Z) - A Universal Class of Sharpness-Aware Minimization Algorithms [57.29207151446387]
We introduce a new class of sharpness measures, leading to new sharpness-aware objective functions.
We prove that these measures are textitly expressive, allowing any function of the training loss Hessian matrix to be represented by appropriate hyper and determinants.
arXiv Detail & Related papers (2024-06-06T01:52:09Z) - FAM: Relative Flatness Aware Minimization [5.132856559837775]
optimizing for flatness has been proposed as early as 1994 by Hochreiter and Schmidthuber.
Recent theoretical work suggests that a particular relative flatness measure can be connected to generalization.
We derive a regularizer based on this relative flatness that is easy to compute, fast, efficient, and works with arbitrary loss functions.
arXiv Detail & Related papers (2023-07-05T14:48:24Z) - The Inductive Bias of Flatness Regularization for Deep Matrix
Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks.
We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - GA-SAM: Gradient-Strength based Adaptive Sharpness-Aware Minimization
for Improved Generalization [22.53923556656022]
Sharpness-Aware Minimization (SAM) algorithm has shown state-of-the-art generalization abilities in vision tasks.
SAM has some difficulty implying SAM to some natural language tasks, especially to models with drastic changes, such as RNNs.
We propose a Gradient-Strength based Adaptive Sharpness-Aware Minimization (GA-SAM) algorithm to help learn algorithms find flat minima that generalize better.
arXiv Detail & Related papers (2022-10-13T10:44:10Z) - Predicting Unreliable Predictions by Shattering a Neural Network [145.3823991041987]
Piecewise linear neural networks can be split into subfunctions.
Subfunctions have their own activation pattern, domain, and empirical error.
Empirical error for the full network can be written as an expectation over subfunctions.
arXiv Detail & Related papers (2021-06-15T18:34:41Z) - Deep learning: a statistical viewpoint [120.94133818355645]
Deep learning has revealed some major surprises from a theoretical perspective.
In particular, simple gradient methods easily find near-perfect solutions to non-optimal training problems.
We conjecture that specific principles underlie these phenomena.
arXiv Detail & Related papers (2021-03-16T16:26:36Z) - Flatness is a False Friend [0.7614628596146599]
Hessian based measures of flatness have been argued, used and shown to relate to generalisation.
We show that for feed forward neural networks under the cross entropy loss, we would expect low loss solutions with large weights to have small Hessian based measures of flatness.
arXiv Detail & Related papers (2020-06-16T11:55:24Z) - Entropic gradient descent algorithms and wide flat minima [6.485776570966397]
We show analytically that there exist Bayes optimal pointwise estimators which correspond to minimizers belonging to wide flat regions.
We extend the analysis to the deep learning scenario by extensive numerical validations.
An easy to compute flatness measure shows a clear correlation with test accuracy.
arXiv Detail & Related papers (2020-06-14T13:22:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.