The Interpolation Phase Transition in Neural Networks: Memorization and
Generalization under Lazy Training
- URL: http://arxiv.org/abs/2007.12826v3
- Date: Thu, 9 Jun 2022 01:25:38 GMT
- Title: The Interpolation Phase Transition in Neural Networks: Memorization and
Generalization under Lazy Training
- Authors: Andrea Montanari and Yiqiao Zhong
- Abstract summary: We study phenomena in the context of two-layers neural networks in the neural tangent (NT) regime.
We prove that as soon as $Ndgg n$, the test error is well approximated by one of kernel ridge regression with respect to the infinite-width kernel.
The latter is in turn well approximated by the error ridge regression, whereby the regularization parameter is increased by a self-induced' term related to the high-degree components of the activation function.
- Score: 10.72393527290646
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern neural networks are often operated in a strongly overparametrized
regime: they comprise so many parameters that they can interpolate the training
set, even if actual labels are replaced by purely random ones. Despite this,
they achieve good prediction error on unseen data: interpolating the training
set does not lead to a large generalization error. Further, overparametrization
appears to be beneficial in that it simplifies the optimization landscape. Here
we study these phenomena in the context of two-layers neural networks in the
neural tangent (NT) regime. We consider a simple data model, with isotropic
covariates vectors in $d$ dimensions, and $N$ hidden neurons. We assume that
both the sample size $n$ and the dimension $d$ are large, and they are
polynomially related. Our first main result is a characterization of the
eigenstructure of the empirical NT kernel in the overparametrized regime $Nd\gg
n$. This characterization implies as a corollary that the minimum eigenvalue of
the empirical NT kernel is bounded away from zero as soon as $Nd\gg n$, and
therefore the network can exactly interpolate arbitrary labels in the same
regime.
Our second main result is a characterization of the generalization error of
NT ridge regression including, as a special case, min-$\ell_2$ norm
interpolation. We prove that, as soon as $Nd\gg n$, the test error is well
approximated by the one of kernel ridge regression with respect to the
infinite-width kernel. The latter is in turn well approximated by the error of
polynomial ridge regression, whereby the regularization parameter is increased
by a `self-induced' term related to the high-degree components of the
activation function. The polynomial degree depends on the sample size and the
dimension (in particular on $\log n/\log d$).
Related papers
- Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - Effective Minkowski Dimension of Deep Nonparametric Regression: Function
Approximation and Statistical Theories [70.90012822736988]
Existing theories on deep nonparametric regression have shown that when the input data lie on a low-dimensional manifold, deep neural networks can adapt to intrinsic data structures.
This paper introduces a relaxed assumption that input data are concentrated around a subset of $mathbbRd$ denoted by $mathcalS$, and the intrinsic dimension $mathcalS$ can be characterized by a new complexity notation -- effective Minkowski dimension.
arXiv Detail & Related papers (2023-06-26T17:13:31Z) - Generalization and Stability of Interpolating Neural Networks with
Minimal Width [37.908159361149835]
We investigate the generalization and optimization of shallow neural-networks trained by gradient in the interpolating regime.
We prove the training loss number minimizations $m=Omega(log4 (n))$ neurons and neurons $Tapprox n$.
With $m=Omega(log4 (n))$ neurons and $Tapprox n$, we bound the test loss training by $tildeO (1/)$.
arXiv Detail & Related papers (2023-02-18T05:06:15Z) - The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich
Regimes [75.59720049837459]
We study the transition from infinite-width behavior to this variance limited regime as a function of sample size $P$ and network width $N$.
We find that finite-size effects can become relevant for very small datasets on the order of $P* sim sqrtN$ for regression with ReLU networks.
arXiv Detail & Related papers (2022-12-23T04:48:04Z) - Bounding the Width of Neural Networks via Coupled Initialization -- A
Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks.
We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z) - Locality defeats the curse of dimensionality in convolutional
teacher-student scenarios [69.2027612631023]
We show that locality is key in determining the learning curve exponent $beta$.
We conclude by proving, using a natural assumption, that performing kernel regression with a ridge that decreases with the size of the training set leads to similar learning curve exponents to those we obtain in the ridgeless case.
arXiv Detail & Related papers (2021-06-16T08:27:31Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Fundamental tradeoffs between memorization and robustness in random
features and neural tangent regimes [15.76663241036412]
We prove for a large class of activation functions that, if the model memorizes even a fraction of the training, then its Sobolev-seminorm is lower-bounded.
Experiments reveal for the first time, (iv) a multiple-descent phenomenon in the robustness of the min-norm interpolator.
arXiv Detail & Related papers (2021-06-04T17:52:50Z) - On the Generalization Power of Overfitted Two-Layer Neural Tangent
Kernel Models [42.72822331030195]
min $ell$-norm overfitting solutions for the neural tangent kernel (NTK) model of a two-layer neural network.
We show that, depending on the ground-truth function, the test error of overfitted NTK models exhibits characteristics that are different from the "double-descent"
For functions outside of this class, we provide a lower bound on the generalization error that does not diminish to zero even when $n$ and $p$ are both large.
arXiv Detail & Related papers (2021-03-09T06:24:59Z) - Large-time asymptotics in deep learning [0.0]
We consider the impact of the final time $T$ (which may indicate the depth of a corresponding ResNet) in training.
For the classical $L2$--regularized empirical risk minimization problem, we show that the training error is at most of the order $mathcalOleft(frac1Tright)$.
In the setting of $ellp$--distance losses, we prove that both the training error and the optimal parameters are at most of the order $mathcalOleft(e-mu
arXiv Detail & Related papers (2020-08-06T07:33:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.