Model, sample, and epoch-wise descents: exact solution of gradient flow
in the random feature model
- URL: http://arxiv.org/abs/2110.11805v1
- Date: Fri, 22 Oct 2021 14:25:54 GMT
- Title: Model, sample, and epoch-wise descents: exact solution of gradient flow
in the random feature model
- Authors: Antoine Bodin and Nicolas Macris
- Abstract summary: We analyze the whole temporal behavior of the generalization and training errors under gradient flow.
We show that in the limit of large system size the full time-evolution path of both errors can be calculated analytically.
Our techniques are based on Cauchy complex integral representations of the errors together with recent random matrix methods based on linear pencils.
- Score: 16.067228939231047
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent evidence has shown the existence of a so-called double-descent and
even triple-descent behavior for the generalization error of deep-learning
models. This important phenomenon commonly appears in implemented neural
network architectures, and also seems to emerge in epoch-wise curves during the
training process. A recent line of research has highlighted that random matrix
tools can be used to obtain precise analytical asymptotics of the
generalization (and training) errors of the random feature model. In this
contribution, we analyze the whole temporal behavior of the generalization and
training errors under gradient flow for the random feature model. We show that
in the asymptotic limit of large system size the full time-evolution path of
both errors can be calculated analytically. This allows us to observe how the
double and triple descents develop over time, if and when early stopping is an
option, and also observe time-wise descent structures. Our techniques are based
on Cauchy complex integral representations of the errors together with recent
random matrix methods based on linear pencils.
Related papers
- Grokking at the Edge of Linear Separability [1.024113475677323]
We analyze the long-time dynamics of logistic classification on a random feature model with a constant label.
We find that Grokking is amplified when classification is applied to training sets which are on the verge of linear separability.
arXiv Detail & Related papers (2024-10-06T14:08:42Z) - Scaling and renormalization in high-dimensional regression [72.59731158970894]
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models.
We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning.
arXiv Detail & Related papers (2024-05-01T15:59:00Z) - A U-turn on Double Descent: Rethinking Parameter Counting in Statistical
Learning [68.76846801719095]
We show that double descent appears exactly when and where it occurs, and that its location is not inherently tied to the threshold p=n.
This provides a resolution to tensions between double descent and statistical intuition.
arXiv Detail & Related papers (2023-10-29T12:05:39Z) - Gradient flow in the gaussian covariate model: exact solution of
learning curves and multiple descent structures [14.578025146641806]
We provide a full and unified analysis of the whole time-evolution of the generalization curve.
We show that our theoretical predictions adequately match the learning curves obtained by gradient descent over realistic datasets.
arXiv Detail & Related papers (2022-12-13T17:39:18Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - Optimization Variance: Exploring Generalization Properties of DNNs [83.78477167211315]
The test error of a deep neural network (DNN) often demonstrates double descent.
We propose a novel metric, optimization variance (OV), to measure the diversity of model updates.
arXiv Detail & Related papers (2021-06-03T09:34:17Z) - Asymptotics of Ridge Regression in Convolutional Models [26.910291664252973]
We derive exact formulae for estimation error of ridge estimators that hold in a certain high-dimensional regime.
We show the double descent phenomenon in our experiments for convolutional models and show that our theoretical results match the experiments.
arXiv Detail & Related papers (2021-03-08T05:56:43Z) - A Bayesian Perspective on Training Speed and Model Selection [51.15664724311443]
We show that a measure of a model's training speed can be used to estimate its marginal likelihood.
We verify our results in model selection tasks for linear models and for the infinite-width limit of deep neural networks.
Our results suggest a promising new direction towards explaining why neural networks trained with gradient descent are biased towards functions that generalize well.
arXiv Detail & Related papers (2020-10-27T17:56:14Z) - The Neural Tangent Kernel in High Dimensions: Triple Descent and a
Multi-Scale Theory of Generalization [34.235007566913396]
Modern deep learning models employ considerably more parameters than required to fit the training data. Whereas conventional statistical wisdom suggests such models should drastically overfit, in practice these models generalize remarkably well.
An emerging paradigm for describing this unexpected behavior is in terms of a emphdouble descent curve.
We provide a precise high-dimensional analysis of generalization with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks with gradient descent.
arXiv Detail & Related papers (2020-08-15T20:55:40Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z) - Generalization Error of Generalized Linear Models in High Dimensions [25.635225717360466]
We provide a framework to characterize neural networks with arbitrary non-linearities.
We analyze the effect of regular logistic regression on learning.
Our model also captures examples between training and distributions special cases.
arXiv Detail & Related papers (2020-05-01T02:17:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.