Width and Depth Limits Commute in Residual Networks
- URL: http://arxiv.org/abs/2302.00453v2
- Date: Thu, 10 Aug 2023 16:09:55 GMT
- Title: Width and Depth Limits Commute in Residual Networks
- Authors: Soufiane Hayou, Greg Yang
- Abstract summary: We show that taking the width and depth to infinity in a deep neural network with skip connections, results in the same covariance structure no matter how that limit is taken.
This explains why the standard infinite-width-then-depth approach provides practical insights even for networks with depth of the same order as width.
We conduct extensive simulations that show an excellent match with our theoretical findings.
- Score: 26.97391529844503
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We show that taking the width and depth to infinity in a deep neural network
with skip connections, when branches are scaled by $1/\sqrt{depth}$ (the only
nontrivial scaling), result in the same covariance structure no matter how that
limit is taken. This explains why the standard infinite-width-then-depth
approach provides practical insights even for networks with depth of the same
order as width. We also demonstrate that the pre-activations, in this case,
have Gaussian distributions which has direct applications in Bayesian deep
learning. We conduct extensive simulations that show an excellent match with
our theoretical findings.
Related papers
- Robust Learning in Bayesian Parallel Branching Graph Neural Networks: The Narrow Width Limit [4.373803477995854]
We investigate the narrow width limit of the Bayesian Parallel Branching Graph Neural Network (BPB-GNN)
We show that when the width of a BPB-GNN is significantly smaller compared to the number of training examples, each branch exhibits more robust learning.
Our results characterize a newly defined narrow-width regime for parallel branching networks in general.
arXiv Detail & Related papers (2024-07-26T15:14:22Z) - Neural networks: deep, shallow, or in between? [0.6043356028687779]
We give estimates for the error of approximation of a compact subset from a Banach space by the outputs of feed-forward neural networks with width W, depth l and Lipschitz activation functions.
We show that, modulo logarithmic factors, rates better that entropy numbers' rates are possibly attainable only for neural networks for which the depth l goes to infinity.
arXiv Detail & Related papers (2023-10-11T04:50:28Z) - Commutative Width and Depth Scaling in Deep Neural Networks [6.019182604573028]
This paper is the second in a series about commutativity of infinite width and depth limits in deep neural networks.
We formally introduce and define the commutativity framework, and discuss its implications on neural network design and scaling.
arXiv Detail & Related papers (2023-10-02T22:39:09Z) - Quantitative CLTs in Deep Neural Networks [12.845031126178593]
We study the distribution of a fully connected neural network with random Gaussian weights and biases.
We obtain quantitative bounds on normal approximations valid at large but finite $n$ and any fixed network depth.
Our bounds are strictly stronger in terms of their dependence on network width than any previously available in the literature.
arXiv Detail & Related papers (2023-07-12T11:35:37Z) - Feature-Learning Networks Are Consistent Across Widths At Realistic
Scales [72.27228085606147]
We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets.
Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training.
We observe, however, that ensembles of narrower networks perform worse than a single wide network.
arXiv Detail & Related papers (2023-05-28T17:09:32Z) - Bayesian Interpolation with Deep Linear Networks [92.1721532941863]
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory.
We show that linear networks make provably optimal predictions at infinite depth.
We also show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth.
arXiv Detail & Related papers (2022-12-29T20:57:46Z) - Unified Field Theory for Deep and Recurrent Neural Networks [56.735884560668985]
We present a unified and systematic derivation of the mean-field theory for both recurrent and deep networks.
We find that convergence towards the mean-field theory is typically slower for recurrent networks than for deep networks.
Our method exposes that Gaussian processes are but the lowest order of a systematic expansion in $1/n$.
arXiv Detail & Related papers (2021-12-10T15:06:11Z) - The Limitations of Large Width in Neural Networks: A Deep Gaussian
Process Perspective [34.67386186205545]
This paper decouples capacity and width via the generalization of neural networks to Deep Gaussian Processes (Deep GP)
Surprisingly, we prove that even nonparametric Deep GP converges to Gaussian processes, effectively becoming shallower without any increase in representational power.
We find there is a "sweet spot" that maximizes test set performance before the limiting GP behavior prevents adaptability, occurring at width = 1 or width = 2 for nonparametric Deep GP.
arXiv Detail & Related papers (2021-06-11T17:58:58Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - Bayesian Deep Ensembles via the Neural Tangent Kernel [49.569912265882124]
We explore the link between deep ensembles and Gaussian processes (GPs) through the lens of the Neural Tangent Kernel (NTK)
We introduce a simple modification to standard deep ensembles training, through addition of a computationally-tractable, randomised and untrainable function to each ensemble member.
We prove that our Bayesian deep ensembles make more conservative predictions than standard deep ensembles in the infinite width limit.
arXiv Detail & Related papers (2020-07-11T22:10:52Z) - On Random Kernels of Residual Architectures [93.94469470368988]
We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets.
Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity.
In DenseNets, however, convergence of the NTK to its limit as the width tends to infinity is guaranteed.
arXiv Detail & Related papers (2020-01-28T16:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.