Gradient flow in the gaussian covariate model: exact solution of
learning curves and multiple descent structures
- URL: http://arxiv.org/abs/2212.06757v1
- Date: Tue, 13 Dec 2022 17:39:18 GMT
- Title: Gradient flow in the gaussian covariate model: exact solution of
learning curves and multiple descent structures
- Authors: Antione Bodin, Nicolas Macris
- Abstract summary: We provide a full and unified analysis of the whole time-evolution of the generalization curve.
We show that our theoretical predictions adequately match the learning curves obtained by gradient descent over realistic datasets.
- Score: 14.578025146641806
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A recent line of work has shown remarkable behaviors of the generalization
error curves in simple learning models. Even the least-squares regression has
shown atypical features such as the model-wise double descent, and further
works have observed triple or multiple descents. Another important
characteristic are the epoch-wise descent structures which emerge during
training. The observations of model-wise and epoch-wise descents have been
analytically derived in limited theoretical settings (such as the random
feature model) and are otherwise experimental. In this work, we provide a full
and unified analysis of the whole time-evolution of the generalization curve,
in the asymptotic large-dimensional regime and under gradient-flow, within a
wider theoretical setting stemming from a gaussian covariate model. In
particular, we cover most cases already disparately observed in the literature,
and also provide examples of the existence of multiple descent structures as a
function of a model parameter or time. Furthermore, we show that our
theoretical predictions adequately match the learning curves obtained by
gradient descent over realistic datasets. Technically we compute averages of
rational expressions involving random matrices using recent developments in
random matrix theory based on "linear pencils". Another contribution, which is
also of independent interest in random matrix theory, is a new derivation of
related fixed point equations (and an extension there-off) using Dyson brownian
motions.
Related papers
- Scaling and renormalization in high-dimensional regression [72.59731158970894]
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models.
We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning.
arXiv Detail & Related papers (2024-05-01T15:59:00Z) - A U-turn on Double Descent: Rethinking Parameter Counting in Statistical
Learning [68.76846801719095]
We show that double descent appears exactly when and where it occurs, and that its location is not inherently tied to the threshold p=n.
This provides a resolution to tensions between double descent and statistical intuition.
arXiv Detail & Related papers (2023-10-29T12:05:39Z) - Probabilistic Unrolling: Scalable, Inverse-Free Maximum Likelihood
Estimation for Latent Gaussian Models [69.22568644711113]
We introduce probabilistic unrolling, a method that combines Monte Carlo sampling with iterative linear solvers to circumvent matrix inversions.
Our theoretical analyses reveal that unrolling and backpropagation through the iterations of the solver can accelerate gradient estimation for maximum likelihood estimation.
In experiments on simulated and real data, we demonstrate that probabilistic unrolling learns latent Gaussian models up to an order of magnitude faster than gradient EM, with minimal losses in model performance.
arXiv Detail & Related papers (2023-06-05T21:08:34Z) - Model, sample, and epoch-wise descents: exact solution of gradient flow
in the random feature model [16.067228939231047]
We analyze the whole temporal behavior of the generalization and training errors under gradient flow.
We show that in the limit of large system size the full time-evolution path of both errors can be calculated analytically.
Our techniques are based on Cauchy complex integral representations of the errors together with recent random matrix methods based on linear pencils.
arXiv Detail & Related papers (2021-10-22T14:25:54Z) - On the Role of Optimization in Double Descent: A Least Squares Study [30.44215064390409]
We show an excess risk bound for the descent gradient solution of the least squares objective.
We find that in case of noiseless regression, double descent is explained solely by optimization-related quantities.
We empirically explore if our predictions hold for neural networks.
arXiv Detail & Related papers (2021-07-27T09:13:11Z) - Asymptotics of Ridge Regression in Convolutional Models [26.910291664252973]
We derive exact formulae for estimation error of ridge estimators that hold in a certain high-dimensional regime.
We show the double descent phenomenon in our experiments for convolutional models and show that our theoretical results match the experiments.
arXiv Detail & Related papers (2021-03-08T05:56:43Z) - Hessian Eigenspectra of More Realistic Nonlinear Models [73.31363313577941]
We make a emphprecise characterization of the Hessian eigenspectra for a broad family of nonlinear models.
Our analysis takes a step forward to identify the origin of many striking features observed in more complex machine learning models.
arXiv Detail & Related papers (2021-03-02T06:59:52Z) - Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear.
We show that it commonly arises in parameters of discrete multiplicative noise due to variance.
A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z) - Dimension Independent Generalization Error by Stochastic Gradient
Descent [12.474236773219067]
We present a theory on the generalization error of descent (SGD) solutions for both and locally convex loss functions.
We show that the generalization error does not depend on the $p$ dimension or depends on the low effective $p$logarithmic factor.
arXiv Detail & Related papers (2020-03-25T03:08:41Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.