Flatness is a False Friend
- URL: http://arxiv.org/abs/2006.09091v1
- Date: Tue, 16 Jun 2020 11:55:24 GMT
- Title: Flatness is a False Friend
- Authors: Diego Granziol
- Abstract summary: Hessian based measures of flatness have been argued, used and shown to relate to generalisation.
We show that for feed forward neural networks under the cross entropy loss, we would expect low loss solutions with large weights to have small Hessian based measures of flatness.
- Score: 0.7614628596146599
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hessian based measures of flatness, such as the trace, Frobenius and spectral
norms, have been argued, used and shown to relate to generalisation. In this
paper we demonstrate that for feed forward neural networks under the cross
entropy loss, we would expect low loss solutions with large weights to have
small Hessian based measures of flatness. This implies that solutions obtained
using $L2$ regularisation should in principle be sharper than those without,
despite generalising better. We show this to be true for logistic regression,
multi-layer perceptrons, simple convolutional, pre-activated and wide residual
networks on the MNIST and CIFAR-$100$ datasets. Furthermore, we show that for
adaptive optimisation algorithms using iterate averaging, on the VGG-$16$
network and CIFAR-$100$ dataset, achieve superior generalisation to SGD but are
$30 \times$ sharper. This theoretical finding, along with experimental results,
raises serious questions about the validity of Hessian based sharpness measures
in the discussion of generalisation. We further show that the Hessian rank can
be bounded by the a constant times number of neurons multiplied by the number
of classes, which in practice is often a small fraction of the network
parameters. This explains the curious observation that many Hessian eigenvalues
are either zero or very near zero which has been reported in the literature.
Related papers
- Error Feedback under $(L_0,L_1)$-Smoothness: Normalization and Momentum [56.37522020675243]
We provide the first proof of convergence for normalized error feedback algorithms across a wide range of machine learning problems.
We show that due to their larger allowable stepsizes, our new normalized error feedback algorithms outperform their non-normalized counterparts on various tasks.
arXiv Detail & Related papers (2024-10-22T10:19:27Z) - FAM: Relative Flatness Aware Minimization [5.132856559837775]
optimizing for flatness has been proposed as early as 1994 by Hochreiter and Schmidthuber.
Recent theoretical work suggests that a particular relative flatness measure can be connected to generalization.
We derive a regularizer based on this relative flatness that is easy to compute, fast, efficient, and works with arbitrary loss functions.
arXiv Detail & Related papers (2023-07-05T14:48:24Z) - The Inductive Bias of Flatness Regularization for Deep Matrix
Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks.
We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z) - Loss Minimization Yields Multicalibration for Large Neural Networks [16.047146428254592]
Multicalibration is a notion of fairness for predictors that requires them to provide calibrated predictions across a large set of protected groups.
We show that minimizing the squared loss over all neural nets of size $n$ implies multicalibration for all but a bounded number of unlucky values of $n$.
arXiv Detail & Related papers (2023-04-19T05:16:20Z) - Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks.
We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space.
We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - BN-invariant sharpness regularizes the training model to better
generalization [72.97766238317081]
We propose a measure of sharpness, BN-Sharpness, which gives consistent value for equivalent networks under BN.
We use the BN-sharpness to regularize the training and design an algorithm to minimize the new regularized objective.
arXiv Detail & Related papers (2021-01-08T10:23:24Z) - Generalized Quantile Loss for Deep Neural Networks [0.8594140167290096]
This note presents a simple way to add a count (or quantile) constraint to a regression neural net, such that given $n$ samples in the training set it guarantees that the prediction of $mn$ samples will be larger than the actual value (the label)
Unlike standard quantile regression networks, the presented method can be applied to any loss function and not necessarily to the standard quantile regression loss, which minimizes the mean absolute differences.
arXiv Detail & Related papers (2020-12-28T16:37:02Z) - Analytic Characterization of the Hessian in Shallow ReLU Models: A Tale
of Symmetry [9.695960412426672]
We analytically characterize the Hessian at various families of spurious minima.
In particular, we prove that for $dge k$ standard Gaussian inputs: (a) of the $dk$ eigenvalues of the Hessian, $dk - O(d)$ concentrate near zero, (b) $Omega(d)$ of the eigenvalues grow linearly with $k$.
arXiv Detail & Related papers (2020-08-04T20:08:35Z) - Generalization error in high-dimensional perceptrons: Approaching Bayes
error with convex optimization [37.57922952189396]
We study the generalization performances of standard classifiers in the high-dimensional regime.
We design an optimal loss and regularizer that provably leads to Bayes-optimal generalization error.
arXiv Detail & Related papers (2020-06-11T16:14:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.