Rapid Feature Evolution Accelerates Learning in Neural Networks
- URL: http://arxiv.org/abs/2105.14301v1
- Date: Sat, 29 May 2021 13:50:03 GMT
- Title: Rapid Feature Evolution Accelerates Learning in Neural Networks
- Authors: Haozhe Shan and Blake Bordelon
- Abstract summary: We analyze the phenomenon of kernel alignment of the NTK with the target functions during gradient descent.
We show that feature evolution is faster and more dramatic in deeper networks.
We also found that networks with multiple output nodes develop separate, specialized kernels for each output channel.
- Score: 2.538209532048867
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural network (NN) training and generalization in the infinite-width limit
are well-characterized by kernel methods with a neural tangent kernel (NTK)
that is stationary in time. However, finite-width NNs consistently outperform
corresponding kernel methods, suggesting the importance of feature learning,
which manifests as the time evolution of NTKs. Here, we analyze the phenomenon
of kernel alignment of the NTK with the target functions during gradient
descent. We first provide a mechanistic explanation for why alignment between
task and kernel occurs in deep linear networks. We then show that this behavior
occurs more generally if one optimizes the feature map over time to accelerate
learning while constraining how quickly the features evolve. Empirically,
gradient descent undergoes a feature learning phase, during which top
eigenfunctions of the NTK quickly align with the target function and the loss
decreases faster than power law in time; it then enters a kernel gradient
descent (KGD) phase where the alignment does not improve significantly and the
training loss decreases in power law. We show that feature evolution is faster
and more dramatic in deeper networks. We also found that networks with multiple
output nodes develop separate, specialized kernels for each output channel, a
phenomenon we termed kernel specialization. We show that this class-specific
alignment is does not occur in linear networks.
Related papers
- Speed Limits for Deep Learning [67.69149326107103]
Recent advancement in thermodynamics allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network.
We provide analytical expressions for these speed limits for linear and linearizable neural networks.
Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense.
arXiv Detail & Related papers (2023-07-27T06:59:46Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Understanding Sparse Feature Updates in Deep Networks using Iterative
Linearisation [2.33877878310217]
We derive an iterative linearised training method as a novel empirical tool to investigate why larger and deeper networks generalise well.
We show that in a variety of cases, iterative linearised training surprisingly performs on par with standard training.
We also show that feature learning is essential for good performance.
arXiv Detail & Related papers (2022-11-22T15:34:59Z) - Neural Networks as Kernel Learners: The Silent Alignment Effect [86.44610122423994]
Neural networks in the lazy training regime converge to kernel machines.
We show that this can indeed happen due to a phenomenon we term silent alignment.
We also demonstrate that non-whitened data can weaken the silent alignment effect.
arXiv Detail & Related papers (2021-10-29T18:22:46Z) - Rapid training of deep neural networks without skip connections or
normalization layers using Deep Kernel Shaping [46.083745557823164]
We identify the main pathologies present in deep networks that prevent them from training fast and generalizing to unseen data.
We show how these can be avoided by carefully controlling the "shape" of the network's kernel function.
arXiv Detail & Related papers (2021-10-05T00:49:36Z) - Scaling Neural Tangent Kernels via Sketching and Random Features [53.57615759435126]
Recent works report that NTK regression can outperform finitely-wide neural networks trained on small-scale datasets.
We design a near input-sparsity time approximation algorithm for NTK, by sketching the expansions of arc-cosine kernels.
We show that a linear regressor trained on our CNTK features matches the accuracy of exact CNTK on CIFAR-10 dataset while achieving 150x speedup.
arXiv Detail & Related papers (2021-06-15T04:44:52Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - Avoiding Kernel Fixed Points: Computing with ELU and GELU Infinite
Networks [12.692279981822011]
We derive the covariance functions of multi-layer perceptrons with exponential linear units (ELU) and Gaussian error linear units (GELU)
We analyse the fixed-point dynamics of iterated kernels corresponding to a broad range of activation functions.
We find that unlike some previously studied neural network kernels, these new kernels exhibit non-trivial fixed-point dynamics.
arXiv Detail & Related papers (2020-02-20T01:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.