Commutative Width and Depth Scaling in Deep Neural Networks
- URL: http://arxiv.org/abs/2310.01683v1
- Date: Mon, 2 Oct 2023 22:39:09 GMT
- Title: Commutative Width and Depth Scaling in Deep Neural Networks
- Authors: Soufiane Hayou
- Abstract summary: This paper is the second in a series about commutativity of infinite width and depth limits in deep neural networks.
We formally introduce and define the commutativity framework, and discuss its implications on neural network design and scaling.
- Score: 6.019182604573028
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper is the second in the series Commutative Scaling of Width and Depth
(WD) about commutativity of infinite width and depth limits in deep neural
networks. Our aim is to understand the behaviour of neural functions (functions
that depend on a neural network model) as width and depth go to infinity (in
some sense), and eventually identify settings under which commutativity holds,
i.e. the neural function tends to the same limit no matter how width and depth
limits are taken. In this paper, we formally introduce and define the
commutativity framework, and discuss its implications on neural network design
and scaling. We study commutativity for the neural covariance kernel which
reflects how network layers separate data. Our findings extend previous results
established in [55] by showing that taking the width and depth to infinity in a
deep neural network with skip connections, when branches are suitably scaled to
avoid exploding behaviour, result in the same covariance structure no matter
how that limit is taken. This has a number of theoretical and practical
implications that we discuss in the paper. The proof techniques in this paper
are novel and rely on tools that are more accessible to readers who are not
familiar with stochastic calculus (used in the proofs of WD(I))).
Related papers
- Super Consistency of Neural Network Landscapes and Learning Rate Transfer [72.54450821671624]
We study the landscape through the lens of the loss Hessian.
We find that certain spectral properties under $mu$P are largely independent of the size of the network.
We show that in the Neural Tangent Kernel (NTK) and other scaling regimes, the sharpness exhibits very different dynamics at different scales.
arXiv Detail & Related papers (2024-02-27T12:28:01Z) - Addressing caveats of neural persistence with deep graph persistence [54.424983583720675]
We find that the variance of network weights and spatial concentration of large weights are the main factors that impact neural persistence.
We propose an extension of the filtration underlying neural persistence to the whole neural network instead of single layers.
This yields our deep graph persistence measure, which implicitly incorporates persistent paths through the network and alleviates variance-related issues.
arXiv Detail & Related papers (2023-07-20T13:34:11Z) - Network Degeneracy as an Indicator of Training Performance: Comparing
Finite and Infinite Width Angle Predictions [3.04585143845864]
We show that as networks get deeper and deeper, they are more susceptible to becoming degenerate.
We use a simple algorithm that can accurately predict the level of degeneracy for any given fully connected ReLU network architecture.
arXiv Detail & Related papers (2023-06-02T13:02:52Z) - Depth Degeneracy in Neural Networks: Vanishing Angles in Fully Connected ReLU Networks on Initialization [5.678271181959529]
We study the evolution of the angle between two inputs to a ReLU neural network as a function of the number of layers.
We validate our theoretical results with Monte Carlo experiments and show that our results accurately approximate finite network behaviour.
We also empirically investigate how the depth degeneracy phenomenon can negatively impact training of real networks.
arXiv Detail & Related papers (2023-02-20T01:30:27Z) - On the Neural Tangent Kernel Analysis of Randomly Pruned Neural Networks [91.3755431537592]
We study how random pruning of the weights affects a neural network's neural kernel (NTK)
In particular, this work establishes an equivalence of the NTKs between a fully-connected neural network and its randomly pruned version.
arXiv Detail & Related papers (2022-03-27T15:22:19Z) - Stochastic Neural Networks with Infinite Width are Deterministic [7.07065078444922]
We study neural networks, a main type of neural network in use.
We prove that as the width of an optimized neural network tends to infinity, its predictive variance on the training set decreases to zero.
arXiv Detail & Related papers (2022-01-30T04:52:31Z) - The Limitations of Large Width in Neural Networks: A Deep Gaussian
Process Perspective [34.67386186205545]
This paper decouples capacity and width via the generalization of neural networks to Deep Gaussian Processes (Deep GP)
Surprisingly, we prove that even nonparametric Deep GP converges to Gaussian processes, effectively becoming shallower without any increase in representational power.
We find there is a "sweet spot" that maximizes test set performance before the limiting GP behavior prevents adaptability, occurring at width = 1 or width = 2 for nonparametric Deep GP.
arXiv Detail & Related papers (2021-06-11T17:58:58Z) - Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks.
We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - On Random Kernels of Residual Architectures [93.94469470368988]
We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets.
Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity.
In DenseNets, however, convergence of the NTK to its limit as the width tends to infinity is guaranteed.
arXiv Detail & Related papers (2020-01-28T16:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.