The Neural Tangent Kernel in High Dimensions: Triple Descent and a
Multi-Scale Theory of Generalization
- URL: http://arxiv.org/abs/2008.06786v1
- Date: Sat, 15 Aug 2020 20:55:40 GMT
- Title: The Neural Tangent Kernel in High Dimensions: Triple Descent and a
Multi-Scale Theory of Generalization
- Authors: Ben Adlam and Jeffrey Pennington
- Abstract summary: Modern deep learning models employ considerably more parameters than required to fit the training data. Whereas conventional statistical wisdom suggests such models should drastically overfit, in practice these models generalize remarkably well.
An emerging paradigm for describing this unexpected behavior is in terms of a emphdouble descent curve.
We provide a precise high-dimensional analysis of generalization with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks with gradient descent.
- Score: 34.235007566913396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern deep learning models employ considerably more parameters than required
to fit the training data. Whereas conventional statistical wisdom suggests such
models should drastically overfit, in practice these models generalize
remarkably well. An emerging paradigm for describing this unexpected behavior
is in terms of a \emph{double descent} curve, in which increasing a model's
capacity causes its test error to first decrease, then increase to a maximum
near the interpolation threshold, and then decrease again in the
overparameterized regime. Recent efforts to explain this phenomenon
theoretically have focused on simple settings, such as linear regression or
kernel regression with unstructured random features, which we argue are too
coarse to reveal important nuances of actual neural networks. We provide a
precise high-dimensional asymptotic analysis of generalization under kernel
regression with the Neural Tangent Kernel, which characterizes the behavior of
wide neural networks optimized with gradient descent. Our results reveal that
the test error has non-monotonic behavior deep in the overparameterized regime
and can even exhibit additional peaks and descents when the number of
parameters scales quadratically with the dataset size.
Related papers
- Scaling and renormalization in high-dimensional regression [72.59731158970894]
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models.
We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning.
arXiv Detail & Related papers (2024-05-01T15:59:00Z) - On the Asymptotic Learning Curves of Kernel Ridge Regression under
Power-law Decay [17.306230523610864]
We show that the 'benign overfitting phenomenon' exists in very wide neural networks only when the noise level is small.
Our results suggest that the phenomenon exists in very wide neural networks only when the noise level is small.
arXiv Detail & Related papers (2023-09-23T11:18:13Z) - Learning Low Dimensional State Spaces with Overparameterized Recurrent
Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory.
Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z) - Second-order regression models exhibit progressive sharpening to the
edge of stability [30.92413051155244]
We show that for quadratic objectives in two dimensions, a second-order regression model exhibits progressive sharpening towards a value that differs slightly from the edge of stability.
In higher dimensions, the model generically shows similar behavior, even without the specific structure of a neural network.
arXiv Detail & Related papers (2022-10-10T17:21:20Z) - The Asymmetric Maximum Margin Bias of Quasi-Homogeneous Neural Networks [26.58848653965855]
We introduce the class of quasi-homogeneous models, which is expressive enough to describe nearly all neural networks with homogeneous activations.
We find that gradient flow implicitly favors a subset of the parameters, unlike in the case of a homogeneous model where all parameters are treated equally.
arXiv Detail & Related papers (2022-10-07T21:14:09Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - Nonasymptotic theory for two-layer neural networks: Beyond the
bias-variance trade-off [10.182922771556742]
We present a nonasymptotic generalization theory for two-layer neural networks with ReLU activation function.
We show that overparametrized random feature models suffer from the curse of dimensionality and thus are suboptimal.
arXiv Detail & Related papers (2021-06-09T03:52:18Z) - Provable Benefits of Overparameterization in Model Compression: From
Double Descent to Pruning Neural Networks [38.153825455980645]
Recent empirical evidence indicates that the practice of overization not only benefits training large models, but also assists - perhaps counterintuitively - building lightweight models.
This paper sheds light on these empirical findings by theoretically characterizing the high-dimensional toolsets of model pruning.
We analytically identify regimes in which, even if the location of the most informative features is known, we are better off fitting a large model and then pruning.
arXiv Detail & Related papers (2020-12-16T05:13:30Z) - A Bayesian Perspective on Training Speed and Model Selection [51.15664724311443]
We show that a measure of a model's training speed can be used to estimate its marginal likelihood.
We verify our results in model selection tasks for linear models and for the infinite-width limit of deep neural networks.
Our results suggest a promising new direction towards explaining why neural networks trained with gradient descent are biased towards functions that generalize well.
arXiv Detail & Related papers (2020-10-27T17:56:14Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z) - A Generalized Neural Tangent Kernel Analysis for Two-layer Neural
Networks [87.23360438947114]
We show that noisy gradient descent with weight decay can still exhibit a " Kernel-like" behavior.
This implies that the training loss converges linearly up to a certain accuracy.
We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.
arXiv Detail & Related papers (2020-02-10T18:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.