Connecting NTK and NNGP: A Unified Theoretical Framework for Neural
Network Learning Dynamics in the Kernel Regime
- URL: http://arxiv.org/abs/2309.04522v1
- Date: Fri, 8 Sep 2023 18:00:01 GMT
- Title: Connecting NTK and NNGP: A Unified Theoretical Framework for Neural
Network Learning Dynamics in the Kernel Regime
- Authors: Yehonatan Avidan, Qianyi Li, Haim Sompolinsky
- Abstract summary: We provide a comprehensive framework for understanding the learning process of deep neural networks in the infinite width limit.
We identify two learning phases characterized by different time scales: gradient-driven and diffusive learning.
- Score: 7.136205674624813
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Artificial neural networks have revolutionized machine learning in recent
years, but a complete theoretical framework for their learning process is still
lacking. Substantial progress has been made for infinitely wide networks. In
this regime, two disparate theoretical frameworks have been used, in which the
network's output is described using kernels: one framework is based on the
Neural Tangent Kernel (NTK) which assumes linearized gradient descent dynamics,
while the Neural Network Gaussian Process (NNGP) kernel assumes a Bayesian
framework. However, the relation between these two frameworks has remained
elusive. This work unifies these two distinct theories using a Markov proximal
learning model for learning dynamics in an ensemble of randomly initialized
infinitely wide deep networks. We derive an exact analytical expression for the
network input-output function during and after learning, and introduce a new
time-dependent Neural Dynamical Kernel (NDK) from which both NTK and NNGP
kernels can be derived. We identify two learning phases characterized by
different time scales: gradient-driven and diffusive learning. In the initial
gradient-driven learning phase, the dynamics is dominated by deterministic
gradient descent, and is described by the NTK theory. This phase is followed by
the diffusive learning stage, during which the network parameters sample the
solution space, ultimately approaching the equilibrium distribution
corresponding to NNGP. Combined with numerical evaluations on synthetic and
benchmark datasets, we provide novel insights into the different roles of
initialization, regularization, and network depth, as well as phenomena such as
early stopping and representational drift. This work closes the gap between the
NTK and NNGP theories, providing a comprehensive framework for understanding
the learning process of deep neural networks in the infinite width limit.
Related papers
- Infinite Width Limits of Self Supervised Neural Networks [6.178817969919849]
We bridge the gap between the NTK and self-supervised learning, focusing on two-layer neural networks trained under the Barlow Twins loss.
We prove that the NTK of Barlow Twins indeed becomes constant as the width of the network approaches infinity.
arXiv Detail & Related papers (2024-11-17T21:13:57Z) - Novel Kernel Models and Exact Representor Theory for Neural Networks Beyond the Over-Parameterized Regime [52.00917519626559]
This paper presents two models of neural-networks and their training applicable to neural networks of arbitrary width, depth and topology.
We also present an exact novel representor theory for layer-wise neural network training with unregularized gradient descent in terms of a local-extrinsic neural kernel (LeNK)
This representor theory gives insight into the role of higher-order statistics in neural network training and the effect of kernel evolution in neural-network kernel models.
arXiv Detail & Related papers (2024-05-24T06:30:36Z) - Demystifying Lazy Training of Neural Networks from a Macroscopic Viewpoint [5.9954962391837885]
We study the gradient descent dynamics of neural networks through the lens of macroscopic limits.
Our study reveals that gradient descent can rapidly drive deep neural networks to zero training loss.
Our approach draws inspiration from the Neural Tangent Kernel (NTK) paradigm.
arXiv Detail & Related papers (2024-04-07T08:07:02Z) - A Unified Kernel for Neural Network Learning [4.0759204898334715]
We present the Unified Neural Kernel (UNK), which characterizes the learning dynamics of neural networks with gradient descents.
UNK maintains the limiting properties of both NNGP and NTK, exhibiting behaviors akin to NTK with a finite learning step.
We also theoretically characterize the uniform tightness and learning convergence of the UNK kernel.
arXiv Detail & Related papers (2024-03-26T07:55:45Z) - Speed Limits for Deep Learning [67.69149326107103]
Recent advancement in thermodynamics allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network.
We provide analytical expressions for these speed limits for linear and linearizable neural networks.
Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense.
arXiv Detail & Related papers (2023-07-27T06:59:46Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - What can linearized neural networks actually say about generalization? [67.83999394554621]
In certain infinitely-wide neural networks, the neural tangent kernel (NTK) theory fully characterizes generalization.
We show that the linear approximations can indeed rank the learning complexity of certain tasks for neural networks.
Our work provides concrete examples of novel deep learning phenomena which can inspire future theoretical research.
arXiv Detail & Related papers (2021-06-12T13:05:11Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - On the Empirical Neural Tangent Kernel of Standard Finite-Width
Convolutional Neural Network Architectures [3.4698840925433765]
It remains an open question how well NTK theory models standard neural network architectures of widths common in practice.
We study this question empirically for two well-known convolutional neural network architectures, namely AlexNet and LeNet.
For wider versions of these networks, where the number of channels and widths of fully-connected layers are increased, the deviation decreases.
arXiv Detail & Related papers (2020-06-24T11:40:36Z) - A Generalized Neural Tangent Kernel Analysis for Two-layer Neural
Networks [87.23360438947114]
We show that noisy gradient descent with weight decay can still exhibit a " Kernel-like" behavior.
This implies that the training loss converges linearly up to a certain accuracy.
We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.
arXiv Detail & Related papers (2020-02-10T18:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.