Dynamics of Finite Width Kernel and Prediction Fluctuations in Mean
Field Neural Networks
- URL: http://arxiv.org/abs/2304.03408v3
- Date: Tue, 7 Nov 2023 12:15:57 GMT
- Title: Dynamics of Finite Width Kernel and Prediction Fluctuations in Mean
Field Neural Networks
- Authors: Blake Bordelon, Cengiz Pehlevan
- Abstract summary: We analyze the dynamics of finite width effects in wide but finite feature learning neural networks.
Our results are non-perturbative in the strength of feature learning.
- Score: 47.73646927060476
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We analyze the dynamics of finite width effects in wide but finite feature
learning neural networks. Starting from a dynamical mean field theory
description of infinite width deep neural network kernel and prediction
dynamics, we provide a characterization of the $O(1/\sqrt{\text{width}})$
fluctuations of the DMFT order parameters over random initializations of the
network weights. Our results, while perturbative in width, unlike prior
analyses, are non-perturbative in the strength of feature learning. In the lazy
limit of network training, all kernels are random but static in time and the
prediction variance has a universal form. However, in the rich, feature
learning regime, the fluctuations of the kernels and predictions are
dynamically coupled with a variance that can be computed self-consistently. In
two layer networks, we show how feature learning can dynamically reduce the
variance of the final tangent kernel and final network predictions. We also
show how initialization variance can slow down online learning in wide but
finite networks. In deeper networks, kernel variance can dramatically
accumulate through subsequent layers at large feature learning strengths, but
feature learning continues to improve the signal-to-noise ratio of the feature
kernels. In discrete time, we demonstrate that large learning rate phenomena
such as edge of stability effects can be well captured by infinite width
dynamics and that initialization variance can decrease dynamically. For CNNs
trained on CIFAR-10, we empirically find significant corrections to both the
bias and variance of network dynamics due to finite width.
Related papers
- Feature Learning and Generalization in Deep Networks with Orthogonal Weights [1.7956122940209063]
Deep neural networks with numerically weights from independent Gaussian distributions can be tuned to criticality.
These networks still exhibit fluctuations that grow linearly with the depth of the network.
We show analytically that rectangular networks with tanh activations and weights from the ensemble of matrices have corresponding preactivation fluctuations.
arXiv Detail & Related papers (2023-10-11T18:00:02Z) - Deep Neural Networks Tend To Extrapolate Predictably [51.303814412294514]
neural network predictions tend to be unpredictable and overconfident when faced with out-of-distribution (OOD) inputs.
We observe that neural network predictions often tend towards a constant value as input data becomes increasingly OOD.
We show how one can leverage our insights in practice to enable risk-sensitive decision-making in the presence of OOD inputs.
arXiv Detail & Related papers (2023-10-02T03:25:32Z) - Speed Limits for Deep Learning [67.69149326107103]
Recent advancement in thermodynamics allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network.
We provide analytical expressions for these speed limits for linear and linearizable neural networks.
Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense.
arXiv Detail & Related papers (2023-07-27T06:59:46Z) - Feature-Learning Networks Are Consistent Across Widths At Realistic
Scales [72.27228085606147]
We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets.
Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training.
We observe, however, that ensembles of narrower networks perform worse than a single wide network.
arXiv Detail & Related papers (2023-05-28T17:09:32Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - The Influence of Learning Rule on Representation Dynamics in Wide Neural
Networks [18.27510863075184]
We analyze infinite-width deep gradient networks trained with feedback alignment (FA), direct feedback alignment (DFA), and error modulated Hebbian learning (Hebb)
We show that, for each of these learning rules, the evolution of the output function at infinite width is governed by a time varying effective neural tangent kernel (eNTK)
In the lazy training limit, this eNTK is static and does not evolve, while in the rich mean-field regime this kernel's evolution can be determined self-consistently with dynamical mean field theory (DMFT)
arXiv Detail & Related papers (2022-10-05T11:33:40Z) - Training Integrable Parameterizations of Deep Neural Networks in the
Infinite-Width Limit [0.0]
Large-width dynamics has emerged as a fruitful viewpoint and led to practical insights on real-world deep networks.
For two-layer neural networks, it has been understood that the nature of the trained model radically changes depending on the scale of the initial random weights.
We propose various methods to avoid this trivial behavior and analyze in detail the resulting dynamics.
arXiv Detail & Related papers (2021-10-29T07:53:35Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.