Related papers: On progressive sharpening, flat minima and generalisation

On progressive sharpening, flat minima and generalisation

URL: http://arxiv.org/abs/2305.14683v4
Date: Tue, 26 Sep 2023 18:42:17 GMT
Title: On progressive sharpening, flat minima and generalisation
Authors: Lachlan Ewen MacDonald and Jack Valmadre and Simon Lucey
Abstract summary: We ground an ansatz tying together the loss Hessian and the input-output Jacobian over training samples. We then prove a series of theoretical results which quantify the degree to which the input-output Jacobian of a model approximates its Lipschitz norm. We use our ansatz, together with our theoretical results, to give a new account of the recently observed progressive sharpening phenomenon.
Score: 39.91683439206866
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a new approach to understanding the relationship between loss curvature and input-output model behaviour in deep learning. Specifically, we use existing empirical analyses of the spectrum of deep network loss Hessians to ground an ansatz tying together the loss Hessian and the input-output Jacobian over training samples during the training of deep neural networks. We then prove a series of theoretical results which quantify the degree to which the input-output Jacobian of a model approximates its Lipschitz norm over a data distribution, and deduce a novel generalisation bound in terms of the empirical Jacobian. We use our ansatz, together with our theoretical results, to give a new account of the recently observed progressive sharpening phenomenon, as well as the generalisation properties of flat minima. Experimental evidence is provided to validate our claims.

Related papers

Flatness After All? [6.698677477097004]
We argue that generalization could be assessed by measuring flatness using a soft rank measure of the Hessian.<n>For non-calibrated models, we connect our flatness measure to the well-known Takeuchi Information Criterion and show that it still provides reliable estimates of generalization gaps for models that are not overly confident.
arXiv Detail & Related papers (2025-06-21T20:33:36Z)
Toward Understanding In-context vs. In-weight Learning [50.24035812301655]
We identify simplified distributional properties that give rise to the emergence and disappearance of in-context learning. We then extend the study to a full large language model, showing how fine-tuning on various collections of natural language prompts can elicit similar in-context and in-weight learning behaviour.
arXiv Detail & Related papers (2024-10-30T14:09:00Z)
Gaussian Universality in Neural Network Dynamics with Generalized Structured Input Distributions [2.3020018305241337]
We analyze the behavior of a deep learning system trained on inputs modeled as Gaussian mixtures to better simulate more general structured inputs. Under certain standardization schemes, the deep learning model converges toward Gaussian setting behavior, even when the input data follow more complex or real-world distributions.
arXiv Detail & Related papers (2024-05-01T17:10:55Z)
On the Generalization Ability of Unsupervised Pretraining [53.06175754026037]
Recent advances in unsupervised learning have shown that unsupervised pre-training, followed by fine-tuning, can improve model generalization. This paper introduces a novel theoretical framework that illuminates the critical factor influencing the transferability of knowledge acquired during unsupervised pre-training to the subsequent fine-tuning phase. Our results contribute to a better understanding of unsupervised pre-training and fine-tuning paradigm, and can shed light on the design of more effective pre-training algorithms.
arXiv Detail & Related papers (2024-03-11T16:23:42Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
A PAC-Bayesian Perspective on the Interpolating Information Criterion [54.548058449535155]
We show how a PAC-Bayes bound is obtained for a general class of models, characterizing factors which influence performance in the interpolating regime. We quantify how the test error for overparameterized models achieving effectively zero training error depends on the quality of the implicit regularization imposed by e.g. the combination of model, parameter-initialization scheme.
arXiv Detail & Related papers (2023-11-13T01:48:08Z)
Sparsity-aware generalization theory for deep neural networks [12.525959293825318]
We present a new approach to analyzing generalization for deep feed-forward ReLU networks. We show fundamental trade-offs between sparsity and generalization.
arXiv Detail & Related papers (2023-07-01T20:59:05Z)
Fine-grained analysis of non-parametric estimation for pairwise learning [9.676007573960383]
We are concerned with the generalization performance of non-parametric estimation for pairwise learning. Our results can be used to handle a wide range of pairwise learning problems including ranking, AUC, pairwise regression and metric and similarity learning.
arXiv Detail & Related papers (2023-05-31T08:13:14Z)
With Greater Distance Comes Worse Performance: On the Perspective of Layer Utilization and Model Generalization [3.6321778403619285]
Generalization of deep neural networks remains one of the main open problems in machine learning. Early layers generally learn representations relevant to performance on both training data and testing data. Deeper layers only minimize training risks and fail to generalize well with testing or mislabeled data.
arXiv Detail & Related papers (2022-01-28T05:26:32Z)
The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data. We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z)
Applicability of Random Matrix Theory in Deep Learning [0.966840768820136]
We investigate the local spectral statistics of the loss surface Hessians of artificial neural networks. Our results shed new light on the applicability of Random Matrix Theory to modelling neural networks. We propose a novel model for the true loss surfaces of neural networks.
arXiv Detail & Related papers (2021-02-12T19:49:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.