Fast Finite Width Neural Tangent Kernel
- URL: http://arxiv.org/abs/2206.08720v1
- Date: Fri, 17 Jun 2022 12:18:22 GMT
- Title: Fast Finite Width Neural Tangent Kernel
- Authors: Roman Novak, Jascha Sohl-Dickstein, Samuel S. Schoenholz
- Abstract summary: The neural network Jacobian has emerged as a central object of study in deep learning.
The finite width NTK is notoriously expensive to compute.
We propose two novel algorithms that change the exponent of the compute and memory requirements of the finite width NTK.
- Score: 47.57136433797996
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Neural Tangent Kernel (NTK), defined as $\Theta_\theta^f(x_1, x_2) =
\left[\partial f(\theta, x_1)\big/\partial \theta\right] \left[\partial
f(\theta, x_2)\big/\partial \theta\right]^T$ where $\left[\partial f(\theta,
\cdot)\big/\partial \theta\right]$ is a neural network (NN) Jacobian, has
emerged as a central object of study in deep learning. In the infinite width
limit, the NTK can sometimes be computed analytically and is useful for
understanding training and generalization of NN architectures. At finite
widths, the NTK is also used to better initialize NNs, compare the conditioning
across models, perform architecture search, and do meta-learning.
Unfortunately, the finite width NTK is notoriously expensive to compute, which
severely limits its practical utility. We perform the first in-depth analysis
of the compute and memory requirements for NTK computation in finite width
networks. Leveraging the structure of neural networks, we further propose two
novel algorithms that change the exponent of the compute and memory
requirements of the finite width NTK, dramatically improving efficiency. Our
algorithms can be applied in a black box fashion to any differentiable
function, including those implementing neural networks. We open-source our
implementations within the Neural Tangents package (arXiv:1912.02803) at
https://github.com/google/neural-tangents.
Related papers
- LinSATNet: The Positive Linear Satisfiability Neural Networks [116.65291739666303]
This paper studies how to introduce the popular positive linear satisfiability to neural networks.
We propose the first differentiable satisfiability layer based on an extension of the classic Sinkhorn algorithm for jointly encoding multiple sets of marginal distributions.
arXiv Detail & Related papers (2024-07-18T22:05:21Z) - The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich
Regimes [75.59720049837459]
We study the transition from infinite-width behavior to this variance limited regime as a function of sample size $P$ and network width $N$.
We find that finite-size effects can become relevant for very small datasets on the order of $P* sim sqrtN$ for regression with ReLU networks.
arXiv Detail & Related papers (2022-12-23T04:48:04Z) - Robust Training and Verification of Implicit Neural Networks: A
Non-Euclidean Contractive Approach [64.23331120621118]
This paper proposes a theoretical and computational framework for training and robustness verification of implicit neural networks.
We introduce a related embedded network and show that the embedded network can be used to provide an $ell_infty$-norm box over-approximation of the reachable sets of the original network.
We apply our algorithms to train implicit neural networks on the MNIST dataset and compare the robustness of our models with the models trained via existing approaches in the literature.
arXiv Detail & Related papers (2022-08-08T03:13:24Z) - A Fast, Well-Founded Approximation to the Empirical Neural Tangent
Kernel [6.372625755672473]
Empirical neural kernels (eNTKs) can provide a good understanding of a given network's representation.
For networks with O output units, the eNTK on N inputs is of size $NO times NO$, taking $O((NO)2)$ memory and up to $O((NO)3)$.
Most existing applications have used one of a handful of approximations yielding $N times N$ kernel matrices.
We prove that one such approximation, which we call "sum of logits", converges to the true eNT
arXiv Detail & Related papers (2022-06-25T03:02:35Z) - Memorization and Optimization in Deep Neural Networks with Minimum
Over-parameterization [14.186776881154127]
The Neural Tangent Kernel (NTK) has emerged as a powerful tool to provide memorization, optimization and generalization guarantees in deep neural networks.
We show that the NTK is well conditioned in a challenging sub-linear setup.
Our key technical contribution is a lower bound on the smallest NTK eigenvalue for deep networks.
arXiv Detail & Related papers (2022-05-20T14:50:24Z) - Scaling Neural Tangent Kernels via Sketching and Random Features [53.57615759435126]
Recent works report that NTK regression can outperform finitely-wide neural networks trained on small-scale datasets.
We design a near input-sparsity time approximation algorithm for NTK, by sketching the expansions of arc-cosine kernels.
We show that a linear regressor trained on our CNTK features matches the accuracy of exact CNTK on CIFAR-10 dataset while achieving 150x speedup.
arXiv Detail & Related papers (2021-06-15T04:44:52Z) - Feature Learning in Infinite-Width Neural Networks [17.309380337367536]
We show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features.
We propose simple modifications to the standard parametrization to allow for feature learning in the limit.
arXiv Detail & Related papers (2020-11-30T03:21:05Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - On the Empirical Neural Tangent Kernel of Standard Finite-Width
Convolutional Neural Network Architectures [3.4698840925433765]
It remains an open question how well NTK theory models standard neural network architectures of widths common in practice.
We study this question empirically for two well-known convolutional neural network architectures, namely AlexNet and LeNet.
For wider versions of these networks, where the number of channels and widths of fully-connected layers are increased, the deviation decreases.
arXiv Detail & Related papers (2020-06-24T11:40:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.