Spectral Bias and Task-Model Alignment Explain Generalization in Kernel
Regression and Infinitely Wide Neural Networks
- URL: http://arxiv.org/abs/2006.13198v6
- Date: Fri, 4 Feb 2022 21:25:17 GMT
- Title: Spectral Bias and Task-Model Alignment Explain Generalization in Kernel
Regression and Infinitely Wide Neural Networks
- Authors: Abdulkadir Canatar, Blake Bordelon, Cengiz Pehlevan
- Abstract summary: Generalization beyond a training dataset is a main goal of machine learning.
Recent observations in deep neural networks contradict conventional wisdom from classical statistics.
We show that more data may impair generalization when noisy or not expressible by the kernel.
- Score: 17.188280334580195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generalization beyond a training dataset is a main goal of machine learning,
but theoretical understanding of generalization remains an open problem for
many models. The need for a new theory is exacerbated by recent observations in
deep neural networks where overparameterization leads to better performance,
contradicting the conventional wisdom from classical statistics. In this paper,
we investigate generalization error for kernel regression, which, besides being
a popular machine learning method, also includes infinitely overparameterized
neural networks trained with gradient descent. We use techniques from
statistical mechanics to derive an analytical expression for generalization
error applicable to any kernel or data distribution. We present applications of
our theory to real and synthetic datasets, and for many kernels including those
that arise from training deep neural networks in the infinite-width limit. We
elucidate an inductive bias of kernel regression to explain data with "simple
functions", which are identified by solving a kernel eigenfunction problem on
the data distribution. This notion of simplicity allows us to characterize
whether a kernel is compatible with a learning task, facilitating good
generalization performance from a small number of training examples. We show
that more data may impair generalization when noisy or not expressible by the
kernel, leading to non-monotonic learning curves with possibly many peaks. To
further understand these phenomena, we turn to the broad class of rotation
invariant kernels, which is relevant to training deep neural networks in the
infinite-width limit, and present a detailed mathematical analysis of them when
data is drawn from a spherically symmetric distribution and the number of input
dimensions is large.
Related papers
- A theory of data variability in Neural Network Bayesian inference [0.70224924046445]
We provide a field-theoretic formalism which covers the generalization properties of infinitely wide networks.
We derive the generalization properties from the statistical properties of the input.
We show that data variability leads to a non-Gaussian action reminiscent of a ($varphi3+varphi4$)-theory.
arXiv Detail & Related papers (2023-07-31T14:11:32Z) - A Scalable Walsh-Hadamard Regularizer to Overcome the Low-degree
Spectral Bias of Neural Networks [79.28094304325116]
Despite the capacity of neural nets to learn arbitrary functions, models trained through gradient descent often exhibit a bias towards simpler'' functions.
We show how this spectral bias towards low-degree frequencies can in fact hurt the neural network's generalization on real-world datasets.
We propose a new scalable functional regularization scheme that aids the neural network to learn higher degree frequencies.
arXiv Detail & Related papers (2023-05-16T20:06:01Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - Neural Networks as Kernel Learners: The Silent Alignment Effect [86.44610122423994]
Neural networks in the lazy training regime converge to kernel machines.
We show that this can indeed happen due to a phenomenon we term silent alignment.
We also demonstrate that non-whitened data can weaken the silent alignment effect.
arXiv Detail & Related papers (2021-10-29T18:22:46Z) - The Separation Capacity of Random Neural Networks [78.25060223808936]
We show that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve this problem with high probability.
We quantify the relevant structure of the data in terms of a novel notion of mutual complexity.
arXiv Detail & Related papers (2021-07-31T10:25:26Z) - Out-of-Distribution Generalization in Kernel Regression [21.958028127426196]
We study generalization in kernel regression when the training and test distributions are different.
We identify an overlap matrix that quantifies the mismatch between distributions for a given kernel.
We develop procedures for optimizing training and test distributions for a given data budget to find best and worst case generalizations under the shift.
arXiv Detail & Related papers (2021-06-04T04:54:25Z) - The Neural Tangent Kernel in High Dimensions: Triple Descent and a
Multi-Scale Theory of Generalization [34.235007566913396]
Modern deep learning models employ considerably more parameters than required to fit the training data. Whereas conventional statistical wisdom suggests such models should drastically overfit, in practice these models generalize remarkably well.
An emerging paradigm for describing this unexpected behavior is in terms of a emphdouble descent curve.
We provide a precise high-dimensional analysis of generalization with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks with gradient descent.
arXiv Detail & Related papers (2020-08-15T20:55:40Z) - Generalization bound of globally optimal non-convex neural network
training: Transportation map estimation by infinite dimensional Langevin
dynamics [50.83356836818667]
We introduce a new theoretical framework to analyze deep learning optimization with connection to its generalization error.
Existing frameworks such as mean field theory and neural tangent kernel theory for neural network optimization analysis typically require taking limit of infinite width of the network to show its global convergence.
arXiv Detail & Related papers (2020-07-11T18:19:50Z) - Generalization Error of Generalized Linear Models in High Dimensions [25.635225717360466]
We provide a framework to characterize neural networks with arbitrary non-linearities.
We analyze the effect of regular logistic regression on learning.
Our model also captures examples between training and distributions special cases.
arXiv Detail & Related papers (2020-05-01T02:17:47Z) - Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural
Networks [17.188280334580195]
We derive analytical expressions for the generalization performance of kernel regression as a function of the number of training samples.
Our expressions apply to wide neural networks due to an equivalence between training them and kernel regression with the Neural Kernel Tangent (NTK)
We verify our theory with simulations on synthetic data and MNIST dataset.
arXiv Detail & Related papers (2020-02-07T00:03:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.