Related papers: FAM: Relative Flatness Aware Minimization

FAM: Relative Flatness Aware Minimization

URL: http://arxiv.org/abs/2307.02337v1
Date: Wed, 5 Jul 2023 14:48:24 GMT
Title: FAM: Relative Flatness Aware Minimization
Authors: Linara Adilova, Amr Abourayya, Jianning Li, Amin Dada, Henning Petzka, Jan Egger, Jens Kleesiek, Michael Kamp
Abstract summary: optimizing for flatness has been proposed as early as 1994 by Hochreiter and Schmidthuber. Recent theoretical work suggests that a particular relative flatness measure can be connected to generalization. We derive a regularizer based on this relative flatness that is easy to compute, fast, efficient, and works with arbitrary loss functions.
Score: 5.132856559837775
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Flatness of the loss curve around a model at hand has been shown to empirically correlate with its generalization ability. Optimizing for flatness has been proposed as early as 1994 by Hochreiter and Schmidthuber, and was followed by more recent successful sharpness-aware optimization techniques. Their widespread adoption in practice, though, is dubious because of the lack of theoretically grounded connection between flatness and generalization, in particular in light of the reparameterization curse - certain reparameterizations of a neural network change most flatness measures but do not change generalization. Recent theoretical work suggests that a particular relative flatness measure can be connected to generalization and solves the reparameterization curse. In this paper, we derive a regularizer based on this relative flatness that is easy to compute, fast, efficient, and works with arbitrary loss functions. It requires computing the Hessian only of a single layer of the network, which makes it applicable to large neural networks, and with it avoids an expensive mapping of the loss surface in the vicinity of the model. In an extensive empirical evaluation we show that this relative flatness aware minimization (FAM) improves generalization in a multitude of applications and models, both in finetuning and standard training. We make the code available at github.

Related papers

A Near Complete Nonasymptotic Generalization Theory For Multilayer Neural Networks: Beyond the Bias-Variance Tradeoff [57.25901375384457]
We propose a nonasymptotic generalization theory for multilayer neural networks with arbitrary Lipschitz activations and general Lipschitz loss functions. In particular, it doens't require the boundness of loss function, as commonly assumed in the literature. We show the near minimax optimality of our theory for multilayer ReLU networks for regression problems.
arXiv Detail & Related papers (2025-03-03T23:34:12Z)
PACE: Marrying generalization in PArameter-efficient fine-tuning with Consistency rEgularization [35.922096876707975]
PACE is a generalization of PArameter-efficient fine-tuning with Consistency rEgularization. It implicitly regularizes gradients for enhanced generalization, but also implicitly aligns the fine-tuned and pre-trained models to retain knowledge. It also improves LoRA in text classification (GLUE) and mathematical reasoning.
arXiv Detail & Related papers (2024-09-25T17:56:00Z)
Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization [29.90109733192208]
Existing theory shows that common architectures prefer flatter minimizers of the training loss. This work critically examines this explanation. Our results suggest that the relationship between sharpness and generalization subtly depends on the data.
arXiv Detail & Related papers (2023-07-20T16:34:58Z)
The Inductive Bias of Flatness Regularization for Deep Matrix Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks. We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z)
A Modern Look at the Relationship between Sharpness and Generalization [64.03012884804458]
Sharpness of minima is promising quantity that can correlate with generalization in deep networks. Sharpness is not invariant under reparametrizations of neural networks. We show that sharpness does not correlate well with generalization.
arXiv Detail & Related papers (2023-02-14T12:38:12Z)
Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks. We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space. We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z)
Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks. We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z)
Optimizing Mode Connectivity via Neuron Alignment [84.26606622400423]
Empirically, the local minima of loss functions can be connected by a learned curve in model space along which the loss remains nearly constant. We propose a more general framework to investigate effect of symmetry on landscape connectivity by accounting for the weight permutations of networks being connected.
arXiv Detail & Related papers (2020-09-05T02:25:23Z)
Flatness is a False Friend [0.7614628596146599]
Hessian based measures of flatness have been argued, used and shown to relate to generalisation. We show that for feed forward neural networks under the cross entropy loss, we would expect low loss solutions with large weights to have small Hessian based measures of flatness.
arXiv Detail & Related papers (2020-06-16T11:55:24Z)
Overparameterization and generalization error: weighted trigonometric interpolation [4.631723879329972]
We study a random Fourier series model, where the task is to estimate the unknown Fourier coefficients from equidistant samples. We show precisely how a bias towards smooth interpolants, in the form of weighted trigonometric generalization, can lead to smaller generalization error.
arXiv Detail & Related papers (2020-06-15T15:53:22Z)
Entropic gradient descent algorithms and wide flat minima [6.485776570966397]
We show analytically that there exist Bayes optimal pointwise estimators which correspond to minimizers belonging to wide flat regions. We extend the analysis to the deep learning scenario by extensive numerical validations. An easy to compute flatness measure shows a clear correlation with test accuracy.
arXiv Detail & Related papers (2020-06-14T13:22:19Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.