Feature-Learning Networks Are Consistent Across Widths At Realistic
Scales
- URL: http://arxiv.org/abs/2305.18411v2
- Date: Wed, 6 Dec 2023 01:45:02 GMT
- Title: Feature-Learning Networks Are Consistent Across Widths At Realistic
Scales
- Authors: Nikhil Vyas, Alexander Atanasov, Blake Bordelon, Depen Morwani,
Sabarish Sainathan, Cengiz Pehlevan
- Abstract summary: We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets.
Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training.
We observe, however, that ensembles of narrower networks perform worse than a single wide network.
- Score: 72.27228085606147
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the effect of width on the dynamics of feature-learning neural
networks across a variety of architectures and datasets. Early in training,
wide neural networks trained on online data have not only identical loss curves
but also agree in their point-wise test predictions throughout training. For
simple tasks such as CIFAR-5m this holds throughout training for networks of
realistic widths. We also show that structural properties of the models,
including internal representations, preactivation distributions, edge of
stability phenomena, and large learning rate effects are consistent across
large widths. This motivates the hypothesis that phenomena seen in realistic
models can be captured by infinite-width, feature-learning limits. For harder
tasks (such as ImageNet and language modeling), and later training times,
finite-width deviations grow systematically. Two distinct effects cause these
deviations across widths. First, the network output has
initialization-dependent variance scaling inversely with width, which can be
removed by ensembling networks. We observe, however, that ensembles of narrower
networks perform worse than a single wide network. We call this the bias of
narrower width. We conclude with a spectral perspective on the origin of this
finite-width bias.
Related papers
- On the Diminishing Returns of Width for Continual Learning [2.9301925522760524]
We analyze Continual Learning Theory to prove that width is directly related to forgetting in Feed-Forward Networks (FFN)
Specifically, we demonstrate that increasing network widths to reduce forgetting yields diminishing returns.
arXiv Detail & Related papers (2024-03-11T03:19:45Z) - Why do Learning Rates Transfer? Reconciling Optimization and Scaling
Limits for Deep Learning [77.82908213345864]
We find empirical evidence that learning rate transfer can be attributed to the fact that under $mu$P and its depth extension, the largest eigenvalue of the training loss Hessian is largely independent of the width and depth of the network.
We show that under the neural tangent kernel (NTK) regime, the sharpness exhibits very different dynamics at different scales, thus preventing learning rate transfer.
arXiv Detail & Related papers (2024-02-27T12:28:01Z) - Dynamics of Finite Width Kernel and Prediction Fluctuations in Mean
Field Neural Networks [47.73646927060476]
We analyze the dynamics of finite width effects in wide but finite feature learning neural networks.
Our results are non-perturbative in the strength of feature learning.
arXiv Detail & Related papers (2023-04-06T23:11:49Z) - Bayesian Interpolation with Deep Linear Networks [92.1721532941863]
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory.
We show that linear networks make provably optimal predictions at infinite depth.
We also show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth.
arXiv Detail & Related papers (2022-12-29T20:57:46Z) - Globally Gated Deep Linear Networks [3.04585143845864]
We introduce Globally Gated Deep Linear Networks (GGDLNs) where gating units are shared among all processing units in each layer.
We derive exact equations for the generalization properties in these networks in the finite-width thermodynamic limit.
Our work is the first exact theoretical solution of learning in a family of nonlinear networks with finite width.
arXiv Detail & Related papers (2022-10-31T16:21:56Z) - Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks.
We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z) - Asymptotics of Wide Convolutional Neural Networks [18.198962344790377]
We study scaling laws for wide CNNs and networks with skip connections.
We find that the difference in performance between finite and infinite width models vanishes at a definite rate with respect to model width.
arXiv Detail & Related papers (2020-08-19T21:22:19Z) - The large learning rate phase of deep learning: the catapult mechanism [50.23041928811575]
We present a class of neural networks with solvable training dynamics.
We find good agreement between our model's predictions and training dynamics in realistic deep learning settings.
We believe our results shed light on characteristics of models trained at different learning rates.
arXiv Detail & Related papers (2020-03-04T17:52:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.