Depth Degeneracy in Neural Networks: Vanishing Angles in Fully Connected
ReLU Networks on Initialization
- URL: http://arxiv.org/abs/2302.09712v2
- Date: Fri, 26 May 2023 18:07:07 GMT
- Title: Depth Degeneracy in Neural Networks: Vanishing Angles in Fully Connected
ReLU Networks on Initialization
- Authors: Cameron Jakub and Mihai Nica
- Abstract summary: deep neural networks are not yet theoretically understood.
In this paper, we examine the evolution of the angle between two inputs to a ReLU neural network as a function of the number of layers.
We validate our theoretical results with Monte Carlo experiments and show that our results accurately approximate finite network behaviour.
- Score: 3.04585143845864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite remarkable performance on a variety of tasks, many properties of deep
neural networks are not yet theoretically understood. One such mystery is the
depth degeneracy phenomenon: the deeper you make your network, the closer your
network is to a constant function on initialization. In this paper, we examine
the evolution of the angle between two inputs to a ReLU neural network as a
function of the number of layers. By using combinatorial expansions, we find
precise formulas for how fast this angle goes to zero as depth increases. These
formulas capture microscopic fluctuations that are not visible in the popular
framework of infinite width limits, and leads to qualitatively different
predictions. We validate our theoretical results with Monte Carlo experiments
and show that our results accurately approximate finite network behaviour. The
formulas are given in terms of the mixed moments of correlated Gaussians passed
through the ReLU function. We also find a surprising combinatorial connection
between these mixed moments and the Bessel numbers that allows us to explicitly
evaluate these moments.
Related papers
- Asymptotics of Learning with Deep Structured (Random) Features [9.366617422860543]
For a large class of feature maps we provide a tight characterisation of the test error associated with learning the readout layer.
In some cases our results can capture feature maps learned by deep, finite-width neural networks trained under gradient descent.
arXiv Detail & Related papers (2024-02-21T18:35:27Z) - Generative Kaleidoscopic Networks [2.321684718906739]
We utilize this property of neural networks to design a dataset kaleidoscope, termed as Generative Kaleidoscopic Networks'
We observed this phenomenon to various degrees for the other deep learning architectures like CNNs, Transformers & U-Nets.
arXiv Detail & Related papers (2024-02-19T02:48:40Z) - Commutative Width and Depth Scaling in Deep Neural Networks [6.019182604573028]
This paper is the second in a series about commutativity of infinite width and depth limits in deep neural networks.
We formally introduce and define the commutativity framework, and discuss its implications on neural network design and scaling.
arXiv Detail & Related papers (2023-10-02T22:39:09Z) - Deep Neural Networks Tend To Extrapolate Predictably [51.303814412294514]
neural network predictions tend to be unpredictable and overconfident when faced with out-of-distribution (OOD) inputs.
We observe that neural network predictions often tend towards a constant value as input data becomes increasingly OOD.
We show how one can leverage our insights in practice to enable risk-sensitive decision-making in the presence of OOD inputs.
arXiv Detail & Related papers (2023-10-02T03:25:32Z) - Network Degeneracy as an Indicator of Training Performance: Comparing
Finite and Infinite Width Angle Predictions [3.04585143845864]
We show that as networks get deeper and deeper, they are more susceptible to becoming degenerate.
We use a simple algorithm that can accurately predict the level of degeneracy for any given fully connected ReLU network architecture.
arXiv Detail & Related papers (2023-06-02T13:02:52Z) - Bayesian Interpolation with Deep Linear Networks [92.1721532941863]
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory.
We show that linear networks make provably optimal predictions at infinite depth.
We also show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth.
arXiv Detail & Related papers (2022-12-29T20:57:46Z) - Deep Architecture Connectivity Matters for Its Convergence: A
Fine-Grained Analysis [94.64007376939735]
We theoretically characterize the impact of connectivity patterns on the convergence of deep neural networks (DNNs) under gradient descent training.
We show that by a simple filtration on "unpromising" connectivity patterns, we can trim down the number of models to evaluate.
arXiv Detail & Related papers (2022-05-11T17:43:54Z) - On the Neural Tangent Kernel Analysis of Randomly Pruned Neural Networks [91.3755431537592]
We study how random pruning of the weights affects a neural network's neural kernel (NTK)
In particular, this work establishes an equivalence of the NTKs between a fully-connected neural network and its randomly pruned version.
arXiv Detail & Related papers (2022-03-27T15:22:19Z) - Convergence of Deep Convolutional Neural Networks [2.5991265608180396]
Convergence of deep neural networks as the depth of the networks tends to infinity is fundamental in building the mathematical foundation for deep learning.
We first study convergence of general ReLU networks with increasing widths and then apply the results obtained to deep convolutional neural networks.
arXiv Detail & Related papers (2021-09-28T07:48:17Z) - Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks.
We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.