Related papers: Limitations of the NTK for Understanding Generalization in Deep Learning

Limitations of the NTK for Understanding Generalization in Deep Learning

URL: http://arxiv.org/abs/2206.10012v1
Date: Mon, 20 Jun 2022 21:23:28 GMT
Title: Limitations of the NTK for Understanding Generalization in Deep Learning
Authors: Nikhil Vyas, Yamini Bansal, Preetum Nakkiran
Abstract summary: We study NTKs through the lens of scaling laws, and demonstrate that they fall short of explaining important aspects of neural network generalization. We show that even if the empirical NTK is allowed to be pre-trained on a constant number of samples, the kernel scaling does not catch up to the neural network scaling.
Score: 13.44676002603497
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The ``Neural Tangent Kernel'' (NTK) (Jacot et al 2018), and its empirical variants have been proposed as a proxy to capture certain behaviors of real neural networks. In this work, we study NTKs through the lens of scaling laws, and demonstrate that they fall short of explaining important aspects of neural network generalization. In particular, we demonstrate realistic settings where finite-width neural networks have significantly better data scaling exponents as compared to their corresponding empirical and infinite NTKs at initialization. This reveals a more fundamental difference between the real networks and NTKs, beyond just a few percentage points of test accuracy. Further, we show that even if the empirical NTK is allowed to be pre-trained on a constant number of samples, the kernel scaling does not catch up to the neural network scaling. Finally, we show that the empirical NTK continues to evolve throughout most of the training, in contrast with prior work which suggests that it stabilizes after a few epochs of training. Altogether, our work establishes concrete limitations of the NTK approach in understanding generalization of real networks on natural datasets.

Related papers

Infinite Width Limits of Self Supervised Neural Networks [6.178817969919849]
We bridge the gap between the NTK and self-supervised learning, focusing on two-layer neural networks trained under the Barlow Twins loss. We prove that the NTK of Barlow Twins indeed becomes constant as the width of the network approaches infinity.
arXiv Detail & Related papers (2024-11-17T21:13:57Z)
Analyzing the Neural Tangent Kernel of Periodically Activated Coordinate Networks [30.92757082348805]
We provide a theoretical understanding of periodically activated networks through an analysis of their Neural Tangent Kernel (NTK) Our findings indicate that periodically activated networks are textitnotably more well-behaved, from the NTK perspective, than ReLU activated networks.
arXiv Detail & Related papers (2024-02-07T12:06:52Z)
Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights. We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z)
Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime. We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z)
Neural networks trained with SGD learn distributions of increasing complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics. We then exploit higher-order statistics only later during training. We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z)
Extrapolation and Spectral Bias of Neural Nets with Hadamard Product: a Polynomial Net Study [55.12108376616355]
The study on NTK has been devoted to typical neural network architectures, but is incomplete for neural networks with Hadamard products (NNs-Hp) In this work, we derive the finite-width-K formulation for a special class of NNs-Hp, i.e., neural networks. We prove their equivalence to the kernel regression predictor with the associated NTK, which expands the application scope of NTK.
arXiv Detail & Related papers (2022-09-16T06:36:06Z)
Neural Tangent Kernel Analysis of Deep Narrow Neural Networks [11.623483126242478]
We present the first trainability guarantee of infinitely deep but narrow neural networks. We then extend the analysis to an infinitely deep convolutional neural network (CNN) and perform brief experiments.
arXiv Detail & Related papers (2022-02-07T07:27:02Z)
What can linearized neural networks actually say about generalization? [67.83999394554621]
In certain infinitely-wide neural networks, the neural tangent kernel (NTK) theory fully characterizes generalization. We show that the linear approximations can indeed rank the learning complexity of certain tasks for neural networks. Our work provides concrete examples of novel deep learning phenomena which can inspire future theoretical research.
arXiv Detail & Related papers (2021-06-12T13:05:11Z)
Analyzing Finite Neural Networks: Can We Trust Neural Tangent Kernel Theory? [2.0711789781518752]
Neural Kernel (NTK) theory is widely used to study the dynamics of infinitely-wide deep neural networks (DNNs) under gradient descent. We study empirically when NTK theory is valid in practice for fully-connected ReLU and sigmoid DNNs. In particular, NTK theory does not explain the behavior of sufficiently deep networks so that their gradients explode as they propagate through the network's layers.
arXiv Detail & Related papers (2020-12-08T15:19:45Z)
Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks. Centered and ensembled finite networks have reduced posterior variance. Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.