Fast Neural Kernel Embeddings for General Activations
- URL: http://arxiv.org/abs/2209.04121v1
- Date: Fri, 9 Sep 2022 05:10:39 GMT
- Title: Fast Neural Kernel Embeddings for General Activations
- Authors: Insu Han, Amir Zandieh, Jaehoon Lee, Roman Novak, Lechao Xiao, Amin
Karbasi
- Abstract summary: We propose a fast sketching method that approximates any multi-layered Neural Network Gaussian Process (NNGP) kernel and Neural Tangent Kernel (NTK) matrices for a wide range of activation functions.
Our method achieves $106times$ speedup for approximate CNTK of a 5-layer Myrtle network on CIFAR-10 dataset.
- Score: 44.665680851537005
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Infinite width limit has shed light on generalization and optimization
aspects of deep learning by establishing connections between neural networks
and kernel methods. Despite their importance, the utility of these kernel
methods was limited in large-scale learning settings due to their
(super-)quadratic runtime and memory complexities. Moreover, most prior works
on neural kernels have focused on the ReLU activation, mainly due to its
popularity but also due to the difficulty of computing such kernels for general
activations. In this work, we overcome such difficulties by providing methods
to work with general activations. First, we compile and expand the list of
activation functions admitting exact dual activation expressions to compute
neural kernels. When the exact computation is unknown, we present methods to
effectively approximate them. We propose a fast sketching method that
approximates any multi-layered Neural Network Gaussian Process (NNGP) kernel
and Neural Tangent Kernel (NTK) matrices for a wide range of activation
functions, going beyond the commonly analyzed ReLU activation. This is done by
showing how to approximate the neural kernels using the truncated Hermite
expansion of any desired activation functions. While most prior works require
data points on the unit sphere, our methods do not suffer from such limitations
and are applicable to any dataset of points in $\mathbb{R}^d$. Furthermore, we
provide a subspace embedding for NNGP and NTK matrices with near input-sparsity
runtime and near-optimal target dimension which applies to any
\emph{homogeneous} dual activation functions with rapidly convergent Taylor
expansion. Empirically, with respect to exact convolutional NTK (CNTK)
computation, our method achieves $106\times$ speedup for approximate CNTK of a
5-layer Myrtle network on CIFAR-10 dataset.
Related papers
- Mean-field Analysis on Two-layer Neural Networks from a Kernel Perspective [40.69646918673903]
We show that two-layer neural networks can learn a union of multiple reproducing kernel Hilbert spaces more efficiently than any kernel methods.
We also develop a label noise procedure, which converges to the global optimum and show that the degrees of freedom appears as an implicit regularization.
arXiv Detail & Related papers (2024-03-22T02:41:57Z) - Controlling the Inductive Bias of Wide Neural Networks by Modifying the Kernel's Spectrum [18.10812063219831]
We introduce Modified Spectrum Kernels (MSKs) to approximate kernels with desired eigenvalues.
We propose a preconditioned gradient descent method, which alters the trajectory of gradient descent.
Our method is both computationally efficient and simple to implement.
arXiv Detail & Related papers (2023-07-26T22:39:47Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - Scaling Neural Tangent Kernels via Sketching and Random Features [53.57615759435126]
Recent works report that NTK regression can outperform finitely-wide neural networks trained on small-scale datasets.
We design a near input-sparsity time approximation algorithm for NTK, by sketching the expansions of arc-cosine kernels.
We show that a linear regressor trained on our CNTK features matches the accuracy of exact CNTK on CIFAR-10 dataset while achieving 150x speedup.
arXiv Detail & Related papers (2021-06-15T04:44:52Z) - The Fast Kernel Transform [21.001203328543006]
We propose the Fast Kernel Transform (FKT), a general algorithm to compute matrix-vector multiplications for datasets in moderate dimensions with quasilinear complexity.
The FKT is easily applied to a broad class of kernels, including Gaussian, Matern, and Rational Quadratic covariance functions and physically motivated Green's functions.
We illustrate the efficacy and versatility of the FKT by providing timing and accuracy benchmarks and by applying it to scale the neighborhood embedding (t-SNE) and Gaussian processes to large real-world data sets.
arXiv Detail & Related papers (2021-06-08T16:15:47Z) - Random Features for the Neural Tangent Kernel [57.132634274795066]
We propose an efficient feature map construction of the Neural Tangent Kernel (NTK) of fully-connected ReLU network.
We show that dimension of the resulting features is much smaller than other baseline feature map constructions to achieve comparable error bounds both in theory and practice.
arXiv Detail & Related papers (2021-04-03T09:08:12Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - Avoiding Kernel Fixed Points: Computing with ELU and GELU Infinite
Networks [12.692279981822011]
We derive the covariance functions of multi-layer perceptrons with exponential linear units (ELU) and Gaussian error linear units (GELU)
We analyse the fixed-point dynamics of iterated kernels corresponding to a broad range of activation functions.
We find that unlike some previously studied neural network kernels, these new kernels exhibit non-trivial fixed-point dynamics.
arXiv Detail & Related papers (2020-02-20T01:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.