Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and
Scaling Limit
- URL: http://arxiv.org/abs/2309.16620v2
- Date: Fri, 8 Dec 2023 18:19:44 GMT
- Title: Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and
Scaling Limit
- Authors: Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, Cengiz
Pehlevan
- Abstract summary: We provide experiments demonstrating that residual architectures including convolutional ResNets and Vision Transformers exhibit transfer of optimal hyper parameters across width and depth.
Using recent developments in the dynamical mean field theory (DMFT) description of neural network learning dynamics, we show that this parameterization of ResNets admits a well-defined feature learning joint infinite-width and infinite-depth limit.
- Score: 48.291961660957384
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The cost of hyperparameter tuning in deep learning has been rising with model
sizes, prompting practitioners to find new tuning methods using a proxy of
smaller networks. One such proposal uses $\mu$P parameterized networks, where
the optimal hyperparameters for small width networks transfer to networks with
arbitrarily large width. However, in this scheme, hyperparameters do not
transfer across depths. As a remedy, we study residual networks with a residual
branch scale of $1/\sqrt{\text{depth}}$ in combination with the $\mu$P
parameterization. We provide experiments demonstrating that residual
architectures including convolutional ResNets and Vision Transformers trained
with this parameterization exhibit transfer of optimal hyperparameters across
width and depth on CIFAR-10 and ImageNet. Furthermore, our empirical findings
are supported and motivated by theory. Using recent developments in the
dynamical mean field theory (DMFT) description of neural network learning
dynamics, we show that this parameterization of ResNets admits a well-defined
feature learning joint infinite-width and infinite-depth limit and show
convergence of finite-size network dynamics towards this limit.
Related papers
- Why do Learning Rates Transfer? Reconciling Optimization and Scaling
Limits for Deep Learning [77.82908213345864]
We find empirical evidence that learning rate transfer can be attributed to the fact that under $mu$P and its depth extension, the largest eigenvalue of the training loss Hessian is largely independent of the width and depth of the network.
We show that under the neural tangent kernel (NTK) regime, the sharpness exhibits very different dynamics at different scales, thus preventing learning rate transfer.
arXiv Detail & Related papers (2024-02-27T12:28:01Z) - Do deep neural networks utilize the weight space efficiently? [2.9914612342004503]
Deep learning models like Transformers and Convolutional Neural Networks (CNNs) have revolutionized various domains, but their parameter-intensive nature hampers deployment in resource-constrained settings.
We introduce a novel concept utilizing column space and row space of weight matrices, which allows for a substantial reduction in model parameters without compromising performance.
Our approach applies to both Bottleneck and Attention layers, effectively halving the parameters while incurring only minor performance degradation.
arXiv Detail & Related papers (2024-01-26T21:51:49Z) - Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks [42.14352997147652]
We investigate the analogous classification for *depthwise parametrizations* of deep residual networks (resnets)
In resnets where each block has only one layer, we identify a unique optimal parametrization, called Depth-$mu$P.
We find that Depth-$mu$P can be characterized as maximizing both feature learning and feature diversity.
arXiv Detail & Related papers (2023-10-03T17:50:40Z) - Optimization Guarantees of Unfolded ISTA and ADMM Networks With Smooth
Soft-Thresholding [57.71603937699949]
We study optimization guarantees, i.e., achieving near-zero training loss with the increase in the number of learning epochs.
We show that the threshold on the number of training samples increases with the increase in the network width.
arXiv Detail & Related papers (2023-09-12T13:03:47Z) - Field theory for optimal signal propagation in ResNets [1.053373860696675]
Residual networks have significantly better trainability and performance than feed-forward networks at large depth.
Previous works found that adding a scaling parameter for the residual branch further improves generalization performance.
We derive a systematic finite-size field theory for residual networks to study signal propagation and its dependence on the scaling for the residual branch.
arXiv Detail & Related papers (2023-05-12T18:14:21Z) - Iterative Soft Shrinkage Learning for Efficient Image Super-Resolution [91.3781512926942]
Image super-resolution (SR) has witnessed extensive neural network designs from CNN to transformer architectures.
This work investigates the potential of network pruning for super-resolution iteration to take advantage of off-the-shelf network designs and reduce the underlying computational overhead.
We propose a novel Iterative Soft Shrinkage-Percentage (ISS-P) method by optimizing the sparse structure of a randomly network at each and tweaking unimportant weights with a small amount proportional to the magnitude scale on-the-fly.
arXiv Detail & Related papers (2023-03-16T21:06:13Z) - The Underlying Correlated Dynamics in Neural Training [6.385006149689549]
Training of neural networks is a computationally intensive task.
We propose a model based on the correlation of the parameters' dynamics, which dramatically reduces the dimensionality.
This representation enhances the understanding of the underlying training dynamics and can pave the way for designing better acceleration techniques.
arXiv Detail & Related papers (2022-12-18T08:34:11Z) - Faster Convergence in Deep-Predictive-Coding Networks to Learn Deeper
Representations [12.716429755564821]
Deep-predictive-coding networks (DPCNs) are hierarchical, generative models that rely on feed-forward and feed-back connections.
A crucial element of DPCNs is a forward-backward inference procedure to uncover sparse states of a dynamic model.
We propose an optimization strategy, with better empirical and theoretical convergence, based on accelerated proximal gradients.
arXiv Detail & Related papers (2021-01-18T02:30:13Z) - Neural Parameter Allocation Search [57.190693718951316]
Training neural networks requires increasing amounts of memory.
Existing methods assume networks have many identical layers and utilize hand-crafted sharing strategies that fail to generalize.
We introduce Neural Allocation Search (NPAS), a novel task where the goal is to train a neural network given an arbitrary, fixed parameter budget.
NPAS covers both low-budget regimes, which produce compact networks, as well as a novel high-budget regime, where additional capacity can be added to boost performance without increasing inference FLOPs.
arXiv Detail & Related papers (2020-06-18T15:01:00Z) - Deep Adaptive Inference Networks for Single Image Super-Resolution [72.7304455761067]
Single image super-resolution (SISR) has witnessed tremendous progress in recent years owing to the deployment of deep convolutional neural networks (CNNs)
In this paper, we take a step forward to address this issue by leveraging the adaptive inference networks for deep SISR (AdaDSR)
Our AdaDSR involves an SISR model as backbone and a lightweight adapter module which takes image features and resource constraint as input and predicts a map of local network depth.
arXiv Detail & Related papers (2020-04-08T10:08:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.