Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural
Networks
- URL: http://arxiv.org/abs/2002.02561v7
- Date: Thu, 25 Feb 2021 18:40:10 GMT
- Title: Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural
Networks
- Authors: Blake Bordelon, Abdulkadir Canatar, Cengiz Pehlevan
- Abstract summary: We derive analytical expressions for the generalization performance of kernel regression as a function of the number of training samples.
Our expressions apply to wide neural networks due to an equivalence between training them and kernel regression with the Neural Kernel Tangent (NTK)
We verify our theory with simulations on synthetic data and MNIST dataset.
- Score: 17.188280334580195
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We derive analytical expressions for the generalization performance of kernel
regression as a function of the number of training samples using theoretical
methods from Gaussian processes and statistical physics. Our expressions apply
to wide neural networks due to an equivalence between training them and kernel
regression with the Neural Tangent Kernel (NTK). By computing the decomposition
of the total generalization error due to different spectral components of the
kernel, we identify a new spectral principle: as the size of the training set
grows, kernel machines and neural networks fit successively higher spectral
modes of the target function. When data are sampled from a uniform distribution
on a high-dimensional hypersphere, dot product kernels, including NTK, exhibit
learning stages where different frequency modes of the target function are
learned. We verify our theory with simulations on synthetic data and MNIST
dataset.
Related papers
- Neural Tangent Kernels Motivate Graph Neural Networks with
Cross-Covariance Graphs [94.44374472696272]
We investigate NTKs and alignment in the context of graph neural networks (GNNs)
Our results establish the theoretical guarantees on the optimality of the alignment for a two-layer GNN.
These guarantees are characterized by the graph shift operator being a function of the cross-covariance between the input and the output data.
arXiv Detail & Related papers (2023-10-16T19:54:21Z) - A theory of data variability in Neural Network Bayesian inference [0.70224924046445]
We provide a field-theoretic formalism which covers the generalization properties of infinitely wide networks.
We derive the generalization properties from the statistical properties of the input.
We show that data variability leads to a non-Gaussian action reminiscent of a ($varphi3+varphi4$)-theory.
arXiv Detail & Related papers (2023-07-31T14:11:32Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Spectral Complexity-scaled Generalization Bound of Complex-valued Neural
Networks [78.64167379726163]
This paper is the first work that proves a generalization bound for the complex-valued neural network.
We conduct experiments by training complex-valued convolutional neural networks on different datasets.
arXiv Detail & Related papers (2021-12-07T03:25:25Z) - Uniform Generalization Bounds for Overparameterized Neural Networks [5.945320097465419]
We prove uniform generalization bounds for overparameterized neural networks in kernel regimes.
Our bounds capture the exact error rates depending on the differentiability of the activation functions.
We show the equivalence between the RKHS corresponding to the NT kernel and its counterpart corresponding to the Mat'ern family of kernels.
arXiv Detail & Related papers (2021-09-13T16:20:13Z) - Scaling Neural Tangent Kernels via Sketching and Random Features [53.57615759435126]
Recent works report that NTK regression can outperform finitely-wide neural networks trained on small-scale datasets.
We design a near input-sparsity time approximation algorithm for NTK, by sketching the expansions of arc-cosine kernels.
We show that a linear regressor trained on our CNTK features matches the accuracy of exact CNTK on CIFAR-10 dataset while achieving 150x speedup.
arXiv Detail & Related papers (2021-06-15T04:44:52Z) - Out-of-Distribution Generalization in Kernel Regression [21.958028127426196]
We study generalization in kernel regression when the training and test distributions are different.
We identify an overlap matrix that quantifies the mismatch between distributions for a given kernel.
We develop procedures for optimizing training and test distributions for a given data budget to find best and worst case generalizations under the shift.
arXiv Detail & Related papers (2021-06-04T04:54:25Z) - Random Features for the Neural Tangent Kernel [57.132634274795066]
We propose an efficient feature map construction of the Neural Tangent Kernel (NTK) of fully-connected ReLU network.
We show that dimension of the resulting features is much smaller than other baseline feature map constructions to achieve comparable error bounds both in theory and practice.
arXiv Detail & Related papers (2021-04-03T09:08:12Z) - Double-descent curves in neural networks: a new perspective using
Gaussian processes [9.153116600213641]
Double-descent curves in neural networks describe the phenomenon that the generalisation error initially descends with increasing parameters, then grows after reaching an optimal number of parameters.
We use techniques from random matrix theory to characterize the spectral distribution of the empirical feature covariance matrix as a width-dependent of the spectrum of the neural network Gaussian process kernel.
arXiv Detail & Related papers (2021-02-14T20:31:49Z) - Spectral Bias and Task-Model Alignment Explain Generalization in Kernel
Regression and Infinitely Wide Neural Networks [17.188280334580195]
Generalization beyond a training dataset is a main goal of machine learning.
Recent observations in deep neural networks contradict conventional wisdom from classical statistics.
We show that more data may impair generalization when noisy or not expressible by the kernel.
arXiv Detail & Related papers (2020-06-23T17:53:11Z) - Multipole Graph Neural Operator for Parametric Partial Differential
Equations [57.90284928158383]
One of the main challenges in using deep learning-based methods for simulating physical systems is formulating physics-based data.
We propose a novel multi-level graph neural network framework that captures interaction at all ranges with only linear complexity.
Experiments confirm our multi-graph network learns discretization-invariant solution operators to PDEs and can be evaluated in linear time.
arXiv Detail & Related papers (2020-06-16T21:56:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.