Neural Networks as Kernel Learners: The Silent Alignment Effect
- URL: http://arxiv.org/abs/2111.00034v1
- Date: Fri, 29 Oct 2021 18:22:46 GMT
- Title: Neural Networks as Kernel Learners: The Silent Alignment Effect
- Authors: Alexander Atanasov, Blake Bordelon, Cengiz Pehlevan
- Abstract summary: Neural networks in the lazy training regime converge to kernel machines.
We show that this can indeed happen due to a phenomenon we term silent alignment.
We also demonstrate that non-whitened data can weaken the silent alignment effect.
- Score: 86.44610122423994
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural networks in the lazy training regime converge to kernel machines. Can
neural networks in the rich feature learning regime learn a kernel machine with
a data-dependent kernel? We demonstrate that this can indeed happen due to a
phenomenon we term silent alignment, which requires that the tangent kernel of
a network evolves in eigenstructure while small and before the loss appreciably
decreases, and grows only in overall scale afterwards. We show that such an
effect takes place in homogenous neural networks with small initialization and
whitened data. We provide an analytical treatment of this effect in the linear
network case. In general, we find that the kernel develops a low-rank
contribution in the early phase of training, and then evolves in overall scale,
yielding a function equivalent to a kernel regression solution with the final
network's tangent kernel. The early spectral learning of the kernel depends on
both depth and on relative learning rates in each layer. We also demonstrate
that non-whitened data can weaken the silent alignment effect.
Related papers
- Mean-field Analysis on Two-layer Neural Networks from a Kernel Perspective [40.69646918673903]
We show that two-layer neural networks can learn a union of multiple reproducing kernel Hilbert spaces more efficiently than any kernel methods.
We also develop a label noise procedure, which converges to the global optimum and show that the degrees of freedom appears as an implicit regularization.
arXiv Detail & Related papers (2024-03-22T02:41:57Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Rapid Feature Evolution Accelerates Learning in Neural Networks [2.538209532048867]
We analyze the phenomenon of kernel alignment of the NTK with the target functions during gradient descent.
We show that feature evolution is faster and more dramatic in deeper networks.
We also found that networks with multiple output nodes develop separate, specialized kernels for each output channel.
arXiv Detail & Related papers (2021-05-29T13:50:03Z) - Kernelized Classification in Deep Networks [49.47339560731506]
We propose a kernelized classification layer for deep networks.
We advocate a nonlinear classification layer by using the kernel trick on the softmax cross-entropy loss function during training.
We show the usefulness of the proposed nonlinear classification layer on several datasets and tasks.
arXiv Detail & Related papers (2020-12-08T21:43:19Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - The Surprising Simplicity of the Early-Time Learning Dynamics of Neural
Networks [43.860358308049044]
In work, we show that these common perceptions can be completely false in the early phase of learning.
We argue that this surprising simplicity can persist in networks with more layers with convolutional architecture.
arXiv Detail & Related papers (2020-06-25T17:42:49Z) - Spectral Bias and Task-Model Alignment Explain Generalization in Kernel
Regression and Infinitely Wide Neural Networks [17.188280334580195]
Generalization beyond a training dataset is a main goal of machine learning.
Recent observations in deep neural networks contradict conventional wisdom from classical statistics.
We show that more data may impair generalization when noisy or not expressible by the kernel.
arXiv Detail & Related papers (2020-06-23T17:53:11Z) - A Generalized Neural Tangent Kernel Analysis for Two-layer Neural
Networks [87.23360438947114]
We show that noisy gradient descent with weight decay can still exhibit a " Kernel-like" behavior.
This implies that the training loss converges linearly up to a certain accuracy.
We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.
arXiv Detail & Related papers (2020-02-10T18:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.