Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks
- URL: http://arxiv.org/abs/2310.02244v5
- Date: Thu, 12 Oct 2023 17:50:31 GMT
- Title: Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks
- Authors: Greg Yang, Dingli Yu, Chen Zhu, Soufiane Hayou
- Abstract summary: We investigate the analogous classification for *depthwise parametrizations* of deep residual networks (resnets)
In resnets where each block has only one layer, we identify a unique optimal parametrization, called Depth-$mu$P.
We find that Depth-$mu$P can be characterized as maximizing both feature learning and feature diversity.
- Score: 42.14352997147652
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: By classifying infinite-width neural networks and identifying the *optimal*
limit, Tensor Programs IV and V demonstrated a universal way, called $\mu$P,
for *widthwise hyperparameter transfer*, i.e., predicting optimal
hyperparameters of wide neural networks from narrow ones. Here we investigate
the analogous classification for *depthwise parametrizations* of deep residual
networks (resnets). We classify depthwise parametrizations of block multiplier
and learning rate by their infinite-width-then-depth limits. In resnets where
each block has only one layer, we identify a unique optimal parametrization,
called Depth-$\mu$P that extends $\mu$P and show empirically it admits
depthwise hyperparameter transfer. We identify *feature diversity* as a crucial
factor in deep networks, and Depth-$\mu$P can be characterized as maximizing
both feature learning and feature diversity. Exploiting this, we find that
absolute value, among all homogeneous nonlinearities, maximizes feature
diversity and indeed empirically leads to significantly better performance.
However, if each block is deeper (such as modern transformers), then we find
fundamental limitations in all possible infinite-depth limits of such
parametrizations, which we illustrate both theoretically and empirically on
simple networks as well as Megatron transformer trained on Common Crawl.
Related papers
- Local Loss Optimization in the Infinite Width: Stable Parameterization of Predictive Coding Networks and Target Propagation [8.35644084613785]
We introduce the maximal update parameterization ($mu$P) in the infinite-width limit for two representative designs of local targets.
By analyzing deep linear networks, we found that PC's gradients interpolate between first-order and Gauss-Newton-like gradients.
We demonstrate that, in specific standard settings, PC in the infinite-width limit behaves more similarly to the first-order gradient.
arXiv Detail & Related papers (2024-11-04T11:38:27Z) - Why do Learning Rates Transfer? Reconciling Optimization and Scaling
Limits for Deep Learning [77.82908213345864]
We find empirical evidence that learning rate transfer can be attributed to the fact that under $mu$P and its depth extension, the largest eigenvalue of the training loss Hessian is largely independent of the width and depth of the network.
We show that under the neural tangent kernel (NTK) regime, the sharpness exhibits very different dynamics at different scales, thus preventing learning rate transfer.
arXiv Detail & Related papers (2024-02-27T12:28:01Z) - Principled Architecture-aware Scaling of Hyperparameters [69.98414153320894]
Training a high-quality deep neural network requires choosing suitable hyperparameters, which is a non-trivial and expensive process.
In this work, we precisely characterize the dependence of initializations and maximal learning rates on the network architecture.
We demonstrate that network rankings can be easily changed by better training networks in benchmarks.
arXiv Detail & Related papers (2024-02-27T11:52:49Z) - Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and
Scaling Limit [48.291961660957384]
We provide experiments demonstrating that residual architectures including convolutional ResNets and Vision Transformers exhibit transfer of optimal hyper parameters across width and depth.
Using recent developments in the dynamical mean field theory (DMFT) description of neural network learning dynamics, we show that this parameterization of ResNets admits a well-defined feature learning joint infinite-width and infinite-depth limit.
arXiv Detail & Related papers (2023-09-28T17:20:50Z) - Bayesian Interpolation with Deep Linear Networks [92.1721532941863]
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory.
We show that linear networks make provably optimal predictions at infinite depth.
We also show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth.
arXiv Detail & Related papers (2022-12-29T20:57:46Z) - ReduNet: A White-box Deep Network from the Principle of Maximizing Rate
Reduction [32.489371527159236]
This work attempts to provide a plausible theoretical framework that aims to interpret modern deep (convolutional) networks from the principles of data compression and discriminative representation.
We show that for high-dimensional multi-class data, the optimal linear discriminative representation maximizes the coding rate difference between the whole dataset and the average of all the subsets.
We show that the basic iterative gradient ascent scheme for optimizing the rate reduction objective naturally leads to a multi-layer deep network, named ReduNet, that shares common characteristics of modern deep networks.
arXiv Detail & Related papers (2021-05-21T16:29:57Z) - Feature Learning in Infinite-Width Neural Networks [17.309380337367536]
We show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features.
We propose simple modifications to the standard parametrization to allow for feature learning in the limit.
arXiv Detail & Related papers (2020-11-30T03:21:05Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - WaveQ: Gradient-Based Deep Quantization of Neural Networks through
Sinusoidal Adaptive Regularization [8.153944203144988]
We propose a novel sinusoidal regularization, called SINAREQ, for deep quantized training.
We show how SINAREQ balance compute efficiency and accuracy, and provide a heterogeneous bitwidth assignment for quantization of a large variety of deep networks.
arXiv Detail & Related papers (2020-02-29T01:19:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.