Related papers: Flatness is a False Friend

Flatness is a False Friend

URL: http://arxiv.org/abs/2006.09091v1
Date: Tue, 16 Jun 2020 11:55:24 GMT
Title: Flatness is a False Friend
Authors: Diego Granziol
Abstract summary: Hessian based measures of flatness have been argued, used and shown to relate to generalisation. We show that for feed forward neural networks under the cross entropy loss, we would expect low loss solutions with large weights to have small Hessian based measures of flatness.
Score: 0.7614628596146599
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hessian based measures of flatness, such as the trace, Frobenius and spectral norms, have been argued, used and shown to relate to generalisation. In this paper we demonstrate that for feed forward neural networks under the cross entropy loss, we would expect low loss solutions with large weights to have small Hessian based measures of flatness. This implies that solutions obtained using $L2$ regularisation should in principle be sharper than those without, despite generalising better. We show this to be true for logistic regression, multi-layer perceptrons, simple convolutional, pre-activated and wide residual networks on the MNIST and CIFAR-$100$ datasets. Furthermore, we show that for adaptive optimisation algorithms using iterate averaging, on the VGG-$16$ network and CIFAR-$100$ dataset, achieve superior generalisation to SGD but are $30 \times$ sharper. This theoretical finding, along with experimental results, raises serious questions about the validity of Hessian based sharpness measures in the discussion of generalisation. We further show that the Hessian rank can be bounded by the a constant times number of neurons multiplied by the number of classes, which in practice is often a small fraction of the network parameters. This explains the curious observation that many Hessian eigenvalues are either zero or very near zero which has been reported in the literature.

Related papers

Convergence Rate Analysis of LION [54.28350823319057]
LION converges iterations of $cal(sqrtdK-)$ measured by gradient Karush-Kuhn-T (sqrtdK-)$. We show that LION can achieve lower loss and higher performance compared to standard SGD.
arXiv Detail & Related papers (2024-11-12T11:30:53Z)
Error Feedback under $(L_0,L_1)$-Smoothness: Normalization and Momentum [56.37522020675243]
We provide the first proof of convergence for normalized error feedback algorithms across a wide range of machine learning problems. We show that due to their larger allowable stepsizes, our new normalized error feedback algorithms outperform their non-normalized counterparts on various tasks.
arXiv Detail & Related papers (2024-10-22T10:19:27Z)
FAM: Relative Flatness Aware Minimization [5.132856559837775]
optimizing for flatness has been proposed as early as 1994 by Hochreiter and Schmidthuber. Recent theoretical work suggests that a particular relative flatness measure can be connected to generalization. We derive a regularizer based on this relative flatness that is easy to compute, fast, efficient, and works with arbitrary loss functions.
arXiv Detail & Related papers (2023-07-05T14:48:24Z)
The Inductive Bias of Flatness Regularization for Deep Matrix Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks. We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z)
Loss Minimization Yields Multicalibration for Large Neural Networks [16.047146428254592]
Multicalibration is a notion of fairness for predictors that requires them to provide calibrated predictions across a large set of protected groups. We show that minimizing the squared loss over all neural nets of size $n$ implies multicalibration for all but a bounded number of unlucky values of $n$.
arXiv Detail & Related papers (2023-04-19T05:16:20Z)
Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks. We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space. We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z)
Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss. We examine how these benign overfitting phenomena occur in a two-layer neural network setting. We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z)
BN-invariant sharpness regularizes the training model to better generalization [72.97766238317081]
We propose a measure of sharpness, BN-Sharpness, which gives consistent value for equivalent networks under BN. We use the BN-sharpness to regularize the training and design an algorithm to minimize the new regularized objective.
arXiv Detail & Related papers (2021-01-08T10:23:24Z)
Generalized Quantile Loss for Deep Neural Networks [0.8594140167290096]
This note presents a simple way to add a count (or quantile) constraint to a regression neural net, such that given $n$ samples in the training set it guarantees that the prediction of $mn$ samples will be larger than the actual value (the label) Unlike standard quantile regression networks, the presented method can be applied to any loss function and not necessarily to the standard quantile regression loss, which minimizes the mean absolute differences.
arXiv Detail & Related papers (2020-12-28T16:37:02Z)
Analytic Characterization of the Hessian in Shallow ReLU Models: A Tale of Symmetry [9.695960412426672]
We analytically characterize the Hessian at various families of spurious minima. In particular, we prove that for $dge k$ standard Gaussian inputs: (a) of the $dk$ eigenvalues of the Hessian, $dk - O(d)$ concentrate near zero, (b) $Omega(d)$ of the eigenvalues grow linearly with $k$.
arXiv Detail & Related papers (2020-08-04T20:08:35Z)
Generalization error in high-dimensional perceptrons: Approaching Bayes error with convex optimization [37.57922952189396]
We study the generalization performances of standard classifiers in the high-dimensional regime. We design an optimal loss and regularizer that provably leads to Bayes-optimal generalization error.
arXiv Detail & Related papers (2020-06-11T16:14:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.