On the Neural Tangent Kernel of Deep Networks with Orthogonal
Initialization
- URL: http://arxiv.org/abs/2004.05867v4
- Date: Wed, 21 Jul 2021 08:08:05 GMT
- Title: On the Neural Tangent Kernel of Deep Networks with Orthogonal
Initialization
- Authors: Wei Huang and Weitao Du and Richard Yi Da Xu
- Abstract summary: We study the dynamics of ultra-wide networks across a range of architectures, including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs)
- Score: 18.424756271923524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The prevailing thinking is that orthogonal weights are crucial to enforcing
dynamical isometry and speeding up training. The increase in learning speed
that results from orthogonal initialization in linear networks has been
well-proven. However, while the same is believed to also hold for nonlinear
networks when the dynamical isometry condition is satisfied, the training
dynamics behind this contention have not been thoroughly explored. In this
work, we study the dynamics of ultra-wide networks across a range of
architectures, including Fully Connected Networks (FCNs) and Convolutional
Neural Networks (CNNs) with orthogonal initialization via neural tangent kernel
(NTK). Through a series of propositions and lemmas, we prove that two NTKs, one
corresponding to Gaussian weights and one to orthogonal weights, are equal when
the network width is infinite. Further, during training, the NTK of an
orthogonally-initialized infinite-width network should theoretically remain
constant. This suggests that the orthogonal initialization cannot speed up
training in the NTK (lazy training) regime, contrary to the prevailing
thoughts. In order to explore under what circumstances can orthogonality
accelerate training, we conduct a thorough empirical investigation outside the
NTK regime. We find that when the hyper-parameters are set to achieve a linear
regime in nonlinear activation, orthogonal initialization can improve the
learning speed with a large learning rate or large depth.
Related papers
- Feature Learning and Generalization in Deep Networks with Orthogonal Weights [1.7956122940209063]
Deep neural networks with numerically weights from independent Gaussian distributions can be tuned to criticality.
These networks still exhibit fluctuations that grow linearly with the depth of the network.
We show analytically that rectangular networks with tanh activations and weights from the ensemble of matrices have corresponding preactivation fluctuations.
arXiv Detail & Related papers (2023-10-11T18:00:02Z) - Implicit regularization of deep residual networks towards neural ODEs [8.075122862553359]
We establish an implicit regularization of deep residual networks towards neural ODEs.
We prove that if the network is as a discretization of a neural ODE, then such a discretization holds throughout training.
arXiv Detail & Related papers (2023-09-03T16:35:59Z) - Speed Limits for Deep Learning [67.69149326107103]
Recent advancement in thermodynamics allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network.
We provide analytical expressions for these speed limits for linear and linearizable neural networks.
Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense.
arXiv Detail & Related papers (2023-07-27T06:59:46Z) - How neural networks learn to classify chaotic time series [77.34726150561087]
We study the inner workings of neural networks trained to classify regular-versus-chaotic time series.
We find that the relation between input periodicity and activation periodicity is key for the performance of LKCNN models.
arXiv Detail & Related papers (2023-06-04T08:53:27Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - The Surprising Simplicity of the Early-Time Learning Dynamics of Neural
Networks [43.860358308049044]
In work, we show that these common perceptions can be completely false in the early phase of learning.
We argue that this surprising simplicity can persist in networks with more layers with convolutional architecture.
arXiv Detail & Related papers (2020-06-25T17:42:49Z) - A Generalized Neural Tangent Kernel Analysis for Two-layer Neural
Networks [87.23360438947114]
We show that noisy gradient descent with weight decay can still exhibit a " Kernel-like" behavior.
This implies that the training loss converges linearly up to a certain accuracy.
We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.
arXiv Detail & Related papers (2020-02-10T18:56:15Z) - Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear
Networks [39.856439772974454]
We show that the width needed for efficient convergence to a global minimum is independent of the depth.
Our results suggest an explanation for the recent empirical successes found by initializing very deep non-linear networks according to the principle of dynamical isometry.
arXiv Detail & Related papers (2020-01-16T18:48:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.