Related papers: Model, sample, and epoch-wise descents: exact solution of gradient flow in the random feature model

Model, sample, and epoch-wise descents: exact solution of gradient flow in the random feature model

URL: http://arxiv.org/abs/2110.11805v1
Date: Fri, 22 Oct 2021 14:25:54 GMT
Title: Model, sample, and epoch-wise descents: exact solution of gradient flow in the random feature model
Authors: Antoine Bodin and Nicolas Macris
Abstract summary: We analyze the whole temporal behavior of the generalization and training errors under gradient flow. We show that in the limit of large system size the full time-evolution path of both errors can be calculated analytically. Our techniques are based on Cauchy complex integral representations of the errors together with recent random matrix methods based on linear pencils.
Score: 16.067228939231047
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent evidence has shown the existence of a so-called double-descent and even triple-descent behavior for the generalization error of deep-learning models. This important phenomenon commonly appears in implemented neural network architectures, and also seems to emerge in epoch-wise curves during the training process. A recent line of research has highlighted that random matrix tools can be used to obtain precise analytical asymptotics of the generalization (and training) errors of the random feature model. In this contribution, we analyze the whole temporal behavior of the generalization and training errors under gradient flow for the random feature model. We show that in the asymptotic limit of large system size the full time-evolution path of both errors can be calculated analytically. This allows us to observe how the double and triple descents develop over time, if and when early stopping is an option, and also observe time-wise descent structures. Our techniques are based on Cauchy complex integral representations of the errors together with recent random matrix methods based on linear pencils.

Related papers

A Classical View on Benign Overfitting: The Role of Sample Size [14.36840959836957]
We focus on almost benign overfitting, where models simultaneously achieve both arbitrarily small training and test errors.<n>This behavior is characteristic of neural networks, which often achieve low (but non-zero) training error while still generalizing well.
arXiv Detail & Related papers (2025-05-16T18:37:51Z)
Grokking at the Edge of Linear Separability [1.024113475677323]
We analyze the long-time dynamics of logistic classification on a random feature model with a constant label. We find that Grokking is amplified when classification is applied to training sets which are on the verge of linear separability.
arXiv Detail & Related papers (2024-10-06T14:08:42Z)
Scaling and renormalization in high-dimensional regression [72.59731158970894]
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models. We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning.
arXiv Detail & Related papers (2024-05-01T15:59:00Z)
A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning [68.76846801719095]
We show that double descent appears exactly when and where it occurs, and that its location is not inherently tied to the threshold p=n. This provides a resolution to tensions between double descent and statistical intuition.
arXiv Detail & Related papers (2023-10-29T12:05:39Z)
Gradient flow in the gaussian covariate model: exact solution of learning curves and multiple descent structures [14.578025146641806]
We provide a full and unified analysis of the whole time-evolution of the generalization curve. We show that our theoretical predictions adequately match the learning curves obtained by gradient descent over realistic datasets.
arXiv Detail & Related papers (2022-12-13T17:39:18Z)
Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error. We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z)
Optimization Variance: Exploring Generalization Properties of DNNs [83.78477167211315]
The test error of a deep neural network (DNN) often demonstrates double descent. We propose a novel metric, optimization variance (OV), to measure the diversity of model updates.
arXiv Detail & Related papers (2021-06-03T09:34:17Z)
Asymptotics of Ridge Regression in Convolutional Models [26.910291664252973]
We derive exact formulae for estimation error of ridge estimators that hold in a certain high-dimensional regime. We show the double descent phenomenon in our experiments for convolutional models and show that our theoretical results match the experiments.
arXiv Detail & Related papers (2021-03-08T05:56:43Z)
A Bayesian Perspective on Training Speed and Model Selection [51.15664724311443]
We show that a measure of a model's training speed can be used to estimate its marginal likelihood. We verify our results in model selection tasks for linear models and for the infinite-width limit of deep neural networks. Our results suggest a promising new direction towards explaining why neural networks trained with gradient descent are biased towards functions that generalize well.
arXiv Detail & Related papers (2020-10-27T17:56:14Z)
The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization [34.235007566913396]
Modern deep learning models employ considerably more parameters than required to fit the training data. Whereas conventional statistical wisdom suggests such models should drastically overfit, in practice these models generalize remarkably well. An emerging paradigm for describing this unexpected behavior is in terms of a emphdouble descent curve. We provide a precise high-dimensional analysis of generalization with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks with gradient descent.
arXiv Detail & Related papers (2020-08-15T20:55:40Z)
Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers. We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model. Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
Generalization Error of Generalized Linear Models in High Dimensions [25.635225717360466]
We provide a framework to characterize neural networks with arbitrary non-linearities. We analyze the effect of regular logistic regression on learning. Our model also captures examples between training and distributions special cases.
arXiv Detail & Related papers (2020-05-01T02:17:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.