On the linearity of large non-linear models: when and why the tangent
kernel is constant
- URL: http://arxiv.org/abs/2010.01092v3
- Date: Sat, 20 Feb 2021 02:48:39 GMT
- Title: On the linearity of large non-linear models: when and why the tangent
kernel is constant
- Authors: Chaoyue Liu, Libin Zhu, Mikhail Belkin
- Abstract summary: We shed light on the remarkable phenomenon of transition to linearity of certain neural networks as their width approaches infinity.
We show that the transition to linearity of the model and, equivalently, constancy of the (neural) tangent kernel (NTK) result from the scaling properties of the norm of the Hessian matrix of the network.
- Score: 20.44438519046223
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of this work is to shed light on the remarkable phenomenon of
transition to linearity of certain neural networks as their width approaches
infinity. We show that the transition to linearity of the model and,
equivalently, constancy of the (neural) tangent kernel (NTK) result from the
scaling properties of the norm of the Hessian matrix of the network as a
function of the network width. We present a general framework for understanding
the constancy of the tangent kernel via Hessian scaling applicable to the
standard classes of neural networks. Our analysis provides a new perspective on
the phenomenon of constant tangent kernel, which is different from the widely
accepted "lazy training". Furthermore, we show that the transition to linearity
is not a general property of wide neural networks and does not hold when the
last layer of the network is non-linear. It is also not necessary for
successful optimization by gradient descent.
Related papers
- Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Nonlinear Advantage: Trained Networks Might Not Be As Complex as You
Think [0.0]
We investigate how much we can simplify the network function towards linearity before performance collapses.
We find that after training, we are able to linearize a significant number of nonlinear units while maintaining a high performance.
Under sparsity pressure, we find that the remaining nonlinear units organize into distinct structures, forming core-networks of near constant effective depth and width.
arXiv Detail & Related papers (2022-11-30T17:24:14Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Transition to Linearity of General Neural Networks with Directed Acyclic
Graph Architecture [20.44438519046223]
We show that feedforward neural networks corresponding to arbitrary directed acyclic graphs undergo transition to linearity as their "width" approaches infinity.
Our results identify the mathematical structure underlying transition to linearity and generalize a number of recent works aimed at characterizing transition to linearity or constancy of the Neural Tangent Kernel for standard architectures.
arXiv Detail & Related papers (2022-05-24T04:57:35Z) - Exploring Linear Feature Disentanglement For Neural Networks [63.20827189693117]
Non-linear activation functions, e.g., Sigmoid, ReLU, and Tanh, have achieved great success in neural networks (NNs)
Due to the complex non-linear characteristic of samples, the objective of those activation functions is to project samples from their original feature space to a linear separable feature space.
This phenomenon ignites our interest in exploring whether all features need to be transformed by all non-linear functions in current typical NNs.
arXiv Detail & Related papers (2022-03-22T13:09:17Z) - Transition to Linearity of Wide Neural Networks is an Emerging Property
of Assembling Weak Models [20.44438519046223]
Wide neural networks with linear output layer have been shown to be near-linear, and to have near-constant neural tangent kernel (NTK)
We show that the linearity of wide neural networks is, in fact, an emerging property of assembling a large number of diverse "weak" sub-models, none of which dominate the assembly.
arXiv Detail & Related papers (2022-03-10T01:27:01Z) - Deep orthogonal linear networks are shallow [9.434391240650266]
We show that training the weights with gradient gradient descent is equivalent to training the whole factorization by gradient descent.
This means that there is no effect of overparametrization and implicit bias at all in this setting.
arXiv Detail & Related papers (2020-11-27T16:57:19Z) - A Unifying View on Implicit Bias in Training Linear Neural Networks [31.65006970108761]
We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training.
We propose a tensor formulation of neural networks that includes fully-connected, diagonal, and convolutional networks as special cases.
arXiv Detail & Related papers (2020-10-06T06:08:35Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z) - A Generalized Neural Tangent Kernel Analysis for Two-layer Neural
Networks [87.23360438947114]
We show that noisy gradient descent with weight decay can still exhibit a " Kernel-like" behavior.
This implies that the training loss converges linearly up to a certain accuracy.
We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.
arXiv Detail & Related papers (2020-02-10T18:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.