Why Flatness Correlates With Generalization For Deep Neural Networks
- URL: http://arxiv.org/abs/2103.06219v1
- Date: Wed, 10 Mar 2021 17:44:52 GMT
- Title: Why Flatness Correlates With Generalization For Deep Neural Networks
- Authors: Shuofeng Zhang, Isaac Reid, Guillermo Valle P\'erez, Ard Louis
- Abstract summary: We argue that local flatness measures correlate with generalization because they are local approximations to a global property.
For functions that give zero error on a test set, it is directly proportional to the Bayesian posterior.
Some variants of SGD can break the flatness-generalization correlation, while the volume-generalization correlation remains intact.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The intuition that local flatness of the loss landscape is correlated with
better generalization for deep neural networks (DNNs) has been explored for
decades, spawning many different local flatness measures. Here we argue that
these measures correlate with generalization because they are local
approximations to a global property, the volume of the set of parameters
mapping to a specific function. This global volume is equivalent to the
Bayesian prior upon initialization. For functions that give zero error on a
test set, it is directly proportional to the Bayesian posterior, making volume
a more robust and theoretically better grounded predictor of generalization
than flatness. Whilst flatness measures fail under parameter re-scaling, volume
remains invariant and therefore continues to correlate well with
generalization. Moreover, some variants of SGD can break the
flatness-generalization correlation, while the volume-generalization
correlation remains intact.
Related papers
- Generalized Laplace Approximation [23.185126261153236]
We introduce a unified theoretical framework to attribute Bayesian inconsistency to model misspecification and inadequate priors.
We propose the generalized Laplace approximation, which involves a simple adjustment to the Hessian matrix of the regularized loss function.
We assess the performance and properties of the generalized Laplace approximation on state-of-the-art neural networks and real-world datasets.
arXiv Detail & Related papers (2024-05-22T11:11:42Z) - FAM: Relative Flatness Aware Minimization [5.132856559837775]
optimizing for flatness has been proposed as early as 1994 by Hochreiter and Schmidthuber.
Recent theoretical work suggests that a particular relative flatness measure can be connected to generalization.
We derive a regularizer based on this relative flatness that is easy to compute, fast, efficient, and works with arbitrary loss functions.
arXiv Detail & Related papers (2023-07-05T14:48:24Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks.
We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space.
We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z) - On generalization bounds for deep networks based on loss surface
implicit regularization [5.68558935178946]
Modern deep neural networks generalize well despite a large number of parameters.
That modern deep neural networks generalize well despite a large number of parameters contradicts the classical statistical learning theory.
arXiv Detail & Related papers (2022-01-12T16:41:34Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Interpolation can hurt robust generalization even when there is no noise [76.3492338989419]
We show that avoiding generalization through ridge regularization can significantly improve generalization even in the absence of noise.
We prove this phenomenon for the robust risk of both linear regression and classification and hence provide the first theoretical result on robust overfitting.
arXiv Detail & Related papers (2021-08-05T23:04:15Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - Implicit Regularization in ReLU Networks with the Square Loss [56.70360094597169]
We show that it is impossible to characterize the implicit regularization with the square loss by any explicit function of the model parameters.
Our results suggest that a more general framework may be needed to understand implicit regularization for nonlinear predictors.
arXiv Detail & Related papers (2020-12-09T16:48:03Z) - Entropic gradient descent algorithms and wide flat minima [6.485776570966397]
We show analytically that there exist Bayes optimal pointwise estimators which correspond to minimizers belonging to wide flat regions.
We extend the analysis to the deep learning scenario by extensive numerical validations.
An easy to compute flatness measure shows a clear correlation with test accuracy.
arXiv Detail & Related papers (2020-06-14T13:22:19Z) - Relative Flatness and Generalization [31.307340632319583]
Flatness of the loss curve is conjectured to be connected to the generalization ability of machine learning models.
It is still an open theoretical problem why and under which circumstances flatness is connected to generalization.
arXiv Detail & Related papers (2020-01-03T11:39:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.