Related papers: Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks

Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks

URL: http://arxiv.org/abs/2310.02244v5
Date: Thu, 12 Oct 2023 17:50:31 GMT
Title: Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks
Authors: Greg Yang, Dingli Yu, Chen Zhu, Soufiane Hayou
Abstract summary: We investigate the analogous classification for *depthwise parametrizations* of deep residual networks (resnets) In resnets where each block has only one layer, we identify a unique optimal parametrization, called Depth-$mu$P. We find that Depth-$mu$P can be characterized as maximizing both feature learning and feature diversity.
Score: 42.14352997147652
License: http://creativecommons.org/licenses/by/4.0/
Abstract: By classifying infinite-width neural networks and identifying the *optimal* limit, Tensor Programs IV and V demonstrated a universal way, called $\mu$P, for *widthwise hyperparameter transfer*, i.e., predicting optimal hyperparameters of wide neural networks from narrow ones. Here we investigate the analogous classification for *depthwise parametrizations* of deep residual networks (resnets). We classify depthwise parametrizations of block multiplier and learning rate by their infinite-width-then-depth limits. In resnets where each block has only one layer, we identify a unique optimal parametrization, called Depth-$\mu$P that extends $\mu$P and show empirically it admits depthwise hyperparameter transfer. We identify *feature diversity* as a crucial factor in deep networks, and Depth-$\mu$P can be characterized as maximizing both feature learning and feature diversity. Exploiting this, we find that absolute value, among all homogeneous nonlinearities, maximizes feature diversity and indeed empirically leads to significantly better performance. However, if each block is deeper (such as modern transformers), then we find fundamental limitations in all possible infinite-depth limits of such parametrizations, which we illustrate both theoretically and empirically on simple networks as well as Megatron transformer trained on Common Crawl.

Related papers

Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization [66.03821840425539]
In this paper, we investigate the training dynamics of $L$-layer neural networks using the tensor gradient program (SGD) framework. We show that SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum.
arXiv Detail & Related papers (2025-03-12T17:33:13Z)
Local Loss Optimization in the Infinite Width: Stable Parameterization of Predictive Coding Networks and Target Propagation [8.35644084613785]
We introduce the maximal update parameterization ($mu$P) in the infinite-width limit for two representative designs of local targets. By analyzing deep linear networks, we found that PC's gradients interpolate between first-order and Gauss-Newton-like gradients. We demonstrate that, in specific standard settings, PC in the infinite-width limit behaves more similarly to the first-order gradient.
arXiv Detail & Related papers (2024-11-04T11:38:27Z)
Super Consistency of Neural Network Landscapes and Learning Rate Transfer [72.54450821671624]
We study the landscape through the lens of the loss Hessian. We find that certain spectral properties under $mu$P are largely independent of the size of the network. We show that in the Neural Tangent Kernel (NTK) and other scaling regimes, the sharpness exhibits very different dynamics at different scales.
arXiv Detail & Related papers (2024-02-27T12:28:01Z)
Principled Architecture-aware Scaling of Hyperparameters [69.98414153320894]
Training a high-quality deep neural network requires choosing suitable hyperparameters, which is a non-trivial and expensive process. In this work, we precisely characterize the dependence of initializations and maximal learning rates on the network architecture. We demonstrate that network rankings can be easily changed by better training networks in benchmarks.
arXiv Detail & Related papers (2024-02-27T11:52:49Z)
Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit [48.291961660957384]
We provide experiments demonstrating that residual architectures including convolutional ResNets and Vision Transformers exhibit transfer of optimal hyper parameters across width and depth. Using recent developments in the dynamical mean field theory (DMFT) description of neural network learning dynamics, we show that this parameterization of ResNets admits a well-defined feature learning joint infinite-width and infinite-depth limit.
arXiv Detail & Related papers (2023-09-28T17:20:50Z)
Bayesian Interpolation with Deep Linear Networks [92.1721532941863]
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory. We show that linear networks make provably optimal predictions at infinite depth. We also show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth.
arXiv Detail & Related papers (2022-12-29T20:57:46Z)
ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction [32.489371527159236]
This work attempts to provide a plausible theoretical framework that aims to interpret modern deep (convolutional) networks from the principles of data compression and discriminative representation. We show that for high-dimensional multi-class data, the optimal linear discriminative representation maximizes the coding rate difference between the whole dataset and the average of all the subsets. We show that the basic iterative gradient ascent scheme for optimizing the rate reduction objective naturally leads to a multi-layer deep network, named ReduNet, that shares common characteristics of modern deep networks.
arXiv Detail & Related papers (2021-05-21T16:29:57Z)
Feature Learning in Infinite-Width Neural Networks [17.309380337367536]
We show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features. We propose simple modifications to the standard parametrization to allow for feature learning in the limit.
arXiv Detail & Related papers (2020-11-30T03:21:05Z)
Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks. Centered and ensembled finite networks have reduced posterior variance. Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z)
WaveQ: Gradient-Based Deep Quantization of Neural Networks through Sinusoidal Adaptive Regularization [8.153944203144988]
We propose a novel sinusoidal regularization, called SINAREQ, for deep quantized training. We show how SINAREQ balance compute efficiency and accuracy, and provide a heterogeneous bitwidth assignment for quantization of a large variety of deep networks.
arXiv Detail & Related papers (2020-02-29T01:19:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.