On Random Kernels of Residual Architectures
- URL: http://arxiv.org/abs/2001.10460v4
- Date: Wed, 17 Jun 2020 14:12:19 GMT
- Title: On Random Kernels of Residual Architectures
- Authors: Etai Littwin, Tomer Galanti, Lior Wolf
- Abstract summary: We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets.
Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity.
In DenseNets, however, convergence of the NTK to its limit as the width tends to infinity is guaranteed.
- Score: 93.94469470368988
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We derive finite width and depth corrections for the Neural Tangent Kernel
(NTK) of ResNets and DenseNets. Our analysis reveals that finite size residual
architectures are initialized much closer to the "kernel regime" than their
vanilla counterparts: while in networks that do not use skip connections,
convergence to the NTK requires one to fix the depth, while increasing the
layers' width. Our findings show that in ResNets, convergence to the NTK may
occur when depth and width simultaneously tend to infinity, provided with a
proper initialization. In DenseNets, however, convergence of the NTK to its
limit as the width tends to infinity is guaranteed, at a rate that is
independent of both the depth and scale of the weights. Our experiments
validate the theoretical results and demonstrate the advantage of deep ResNets
and DenseNets for kernel regression with random gradient features.
Related papers
- On the Neural Tangent Kernel of Equilibrium Models [72.29727250679477]
This work studies the neural tangent kernel (NTK) of the deep equilibrium (DEQ) model.
We show that contrarily a DEQ model still enjoys a deterministic NTK despite its width and depth going to infinity at the same time under mild conditions.
arXiv Detail & Related papers (2023-10-21T16:47:18Z) - Feature Learning and Generalization in Deep Networks with Orthogonal Weights [1.7956122940209063]
Deep neural networks with numerically weights from independent Gaussian distributions can be tuned to criticality.
These networks still exhibit fluctuations that grow linearly with the depth of the network.
We show analytically that rectangular networks with tanh activations and weights from the ensemble of matrices have corresponding preactivation fluctuations.
arXiv Detail & Related papers (2023-10-11T18:00:02Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - On the Neural Tangent Kernel Analysis of Randomly Pruned Neural Networks [91.3755431537592]
We study how random pruning of the weights affects a neural network's neural kernel (NTK)
In particular, this work establishes an equivalence of the NTKs between a fully-connected neural network and its randomly pruned version.
arXiv Detail & Related papers (2022-03-27T15:22:19Z) - Neural Tangent Kernel Beyond the Infinite-Width Limit: Effects of Depth
and Initialization [3.2971341821314777]
We study the NTK of fully-connected ReLU networks with depth comparable to width.
We show that the NTK of deep networks may stay constant during training only in the ordered phase.
arXiv Detail & Related papers (2022-02-01T16:52:16Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - Tight Bounds on the Smallest Eigenvalue of the Neural Tangent Kernel for
Deep ReLU Networks [21.13299067136635]
We provide tight bounds on the smallest eigenvalue of NTK matrices for deep ReLU networks.
In the finite-width setting, the network architectures we consider are quite general.
arXiv Detail & Related papers (2020-12-21T19:32:17Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - Doubly infinite residual neural networks: a diffusion process approach [8.642603456626393]
We show that deep ResNets do not suffer from undesirable forward-propagation properties.
We focus on doubly infinite fully-connected ResNets, for which we consider i.i.d.
Our results highlight a limited expressive power of doubly infinite ResNets when the unscaled network's parameters are i.i.d. and the residual blocks are shallow.
arXiv Detail & Related papers (2020-07-07T07:45:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.