FAM: Relative Flatness Aware Minimization
- URL: http://arxiv.org/abs/2307.02337v1
- Date: Wed, 5 Jul 2023 14:48:24 GMT
- Title: FAM: Relative Flatness Aware Minimization
- Authors: Linara Adilova, Amr Abourayya, Jianning Li, Amin Dada, Henning Petzka,
Jan Egger, Jens Kleesiek, Michael Kamp
- Abstract summary: optimizing for flatness has been proposed as early as 1994 by Hochreiter and Schmidthuber.
Recent theoretical work suggests that a particular relative flatness measure can be connected to generalization.
We derive a regularizer based on this relative flatness that is easy to compute, fast, efficient, and works with arbitrary loss functions.
- Score: 5.132856559837775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Flatness of the loss curve around a model at hand has been shown to
empirically correlate with its generalization ability. Optimizing for flatness
has been proposed as early as 1994 by Hochreiter and Schmidthuber, and was
followed by more recent successful sharpness-aware optimization techniques.
Their widespread adoption in practice, though, is dubious because of the lack
of theoretically grounded connection between flatness and generalization, in
particular in light of the reparameterization curse - certain
reparameterizations of a neural network change most flatness measures but do
not change generalization. Recent theoretical work suggests that a particular
relative flatness measure can be connected to generalization and solves the
reparameterization curse. In this paper, we derive a regularizer based on this
relative flatness that is easy to compute, fast, efficient, and works with
arbitrary loss functions. It requires computing the Hessian only of a single
layer of the network, which makes it applicable to large neural networks, and
with it avoids an expensive mapping of the loss surface in the vicinity of the
model. In an extensive empirical evaluation we show that this relative flatness
aware minimization (FAM) improves generalization in a multitude of applications
and models, both in finetuning and standard training. We make the code
available at github.
Related papers
- Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To
Achieve Better Generalization [29.90109733192208]
Existing theory shows that common architectures prefer flatter minimizers of the training loss.
This work critically examines this explanation.
Our results suggest that the relationship between sharpness and generalization subtly depends on the data.
arXiv Detail & Related papers (2023-07-20T16:34:58Z) - The Inductive Bias of Flatness Regularization for Deep Matrix
Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks.
We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z) - A Modern Look at the Relationship between Sharpness and Generalization [64.03012884804458]
Sharpness of minima is promising quantity that can correlate with generalization in deep networks.
Sharpness is not invariant under reparametrizations of neural networks.
We show that sharpness does not correlate well with generalization.
arXiv Detail & Related papers (2023-02-14T12:38:12Z) - Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks.
We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space.
We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Optimizing Mode Connectivity via Neuron Alignment [84.26606622400423]
Empirically, the local minima of loss functions can be connected by a learned curve in model space along which the loss remains nearly constant.
We propose a more general framework to investigate effect of symmetry on landscape connectivity by accounting for the weight permutations of networks being connected.
arXiv Detail & Related papers (2020-09-05T02:25:23Z) - Flatness is a False Friend [0.7614628596146599]
Hessian based measures of flatness have been argued, used and shown to relate to generalisation.
We show that for feed forward neural networks under the cross entropy loss, we would expect low loss solutions with large weights to have small Hessian based measures of flatness.
arXiv Detail & Related papers (2020-06-16T11:55:24Z) - Overparameterization and generalization error: weighted trigonometric
interpolation [4.631723879329972]
We study a random Fourier series model, where the task is to estimate the unknown Fourier coefficients from equidistant samples.
We show precisely how a bias towards smooth interpolants, in the form of weighted trigonometric generalization, can lead to smaller generalization error.
arXiv Detail & Related papers (2020-06-15T15:53:22Z) - Entropic gradient descent algorithms and wide flat minima [6.485776570966397]
We show analytically that there exist Bayes optimal pointwise estimators which correspond to minimizers belonging to wide flat regions.
We extend the analysis to the deep learning scenario by extensive numerical validations.
An easy to compute flatness measure shows a clear correlation with test accuracy.
arXiv Detail & Related papers (2020-06-14T13:22:19Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.