The Limitations of Large Width in Neural Networks: A Deep Gaussian
Process Perspective
- URL: http://arxiv.org/abs/2106.06529v1
- Date: Fri, 11 Jun 2021 17:58:58 GMT
- Title: The Limitations of Large Width in Neural Networks: A Deep Gaussian
Process Perspective
- Authors: Geoff Pleiss, John P. Cunningham
- Abstract summary: This paper decouples capacity and width via the generalization of neural networks to Deep Gaussian Processes (Deep GP)
Surprisingly, we prove that even nonparametric Deep GP converges to Gaussian processes, effectively becoming shallower without any increase in representational power.
We find there is a "sweet spot" that maximizes test set performance before the limiting GP behavior prevents adaptability, occurring at width = 1 or width = 2 for nonparametric Deep GP.
- Score: 34.67386186205545
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large width limits have been a recent focus of deep learning research: modulo
computational practicalities, do wider networks outperform narrower ones?
Answering this question has been challenging, as conventional networks gain
representational power with width, potentially masking any negative effects.
Our analysis in this paper decouples capacity and width via the generalization
of neural networks to Deep Gaussian Processes (Deep GP), a class of
hierarchical models that subsume neural nets. In doing so, we aim to understand
how width affects standard neural networks once they have sufficient capacity
for a given modeling task. Our theoretical and empirical results on Deep GP
suggest that large width is generally detrimental to hierarchical models.
Surprisingly, we prove that even nonparametric Deep GP converge to Gaussian
processes, effectively becoming shallower without any increase in
representational power. The posterior, which corresponds to a mixture of
data-adaptable basis functions, becomes less data-dependent with width. Our
tail analysis demonstrates that width and depth have opposite effects: depth
accentuates a model's non-Gaussianity, while width makes models increasingly
Gaussian. We find there is a "sweet spot" that maximizes test set performance
before the limiting GP behavior prevents adaptability, occurring at width = 1
or width = 2 for nonparametric Deep GP. These results make strong predictions
about the same phenomenon in conventional neural networks: we show empirically
that many neural network architectures need 10 - 500 hidden units for
sufficient capacity - depending on the dataset - but further width degrades
test performance.
Related papers
- Wide Neural Networks as Gaussian Processes: Lessons from Deep
Equilibrium Models [16.07760622196666]
We study the deep equilibrium model (DEQ), an infinite-depth neural network with shared weight matrices across layers.
Our analysis reveals that as the width of DEQ layers approaches infinity, it converges to a Gaussian process.
Remarkably, this convergence holds even when the limits of depth and width are interchanged.
arXiv Detail & Related papers (2023-10-16T19:00:43Z) - Feature-Learning Networks Are Consistent Across Widths At Realistic
Scales [72.27228085606147]
We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets.
Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training.
We observe, however, that ensembles of narrower networks perform worse than a single wide network.
arXiv Detail & Related papers (2023-05-28T17:09:32Z) - Width and Depth Limits Commute in Residual Networks [26.97391529844503]
We show that taking the width and depth to infinity in a deep neural network with skip connections, results in the same covariance structure no matter how that limit is taken.
This explains why the standard infinite-width-then-depth approach provides practical insights even for networks with depth of the same order as width.
We conduct extensive simulations that show an excellent match with our theoretical findings.
arXiv Detail & Related papers (2023-02-01T13:57:32Z) - Bayesian Interpolation with Deep Linear Networks [92.1721532941863]
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory.
We show that linear networks make provably optimal predictions at infinite depth.
We also show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth.
arXiv Detail & Related papers (2022-12-29T20:57:46Z) - The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich
Regimes [75.59720049837459]
We study the transition from infinite-width behavior to this variance limited regime as a function of sample size $P$ and network width $N$.
We find that finite-size effects can become relevant for very small datasets on the order of $P* sim sqrtN$ for regression with ReLU networks.
arXiv Detail & Related papers (2022-12-23T04:48:04Z) - Contrasting random and learned features in deep Bayesian linear
regression [12.234742322758418]
We study how the ability to learn affects the generalization performance of a simple class of models.
By comparing deep random feature models to deep networks in which all layers are trained, we provide a detailed characterization of the interplay between width, depth, data density, and prior mismatch.
arXiv Detail & Related papers (2022-03-01T15:51:29Z) - Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks.
We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - On Random Kernels of Residual Architectures [93.94469470368988]
We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets.
Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity.
In DenseNets, however, convergence of the NTK to its limit as the width tends to infinity is guaranteed.
arXiv Detail & Related papers (2020-01-28T16:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.