Related papers: Asymptotics of Wide Convolutional Neural Networks

Asymptotics of Wide Convolutional Neural Networks

URL: http://arxiv.org/abs/2008.08675v1
Date: Wed, 19 Aug 2020 21:22:19 GMT
Title: Asymptotics of Wide Convolutional Neural Networks
Authors: Anders Andreassen, Ethan Dyer
Abstract summary: We study scaling laws for wide CNNs and networks with skip connections. We find that the difference in performance between finite and infinite width models vanishes at a definite rate with respect to model width.
Score: 18.198962344790377
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Wide neural networks have proven to be a rich class of architectures for both theory and practice. Motivated by the observation that finite width convolutional networks appear to outperform infinite width networks, we study scaling laws for wide CNNs and networks with skip connections. Following the approach of (Dyer & Gur-Ari, 2019), we present a simple diagrammatic recipe to derive the asymptotic width dependence for many quantities of interest. These scaling relationships provide a solvable description for the training dynamics of wide convolutional networks. We test these relations across a broad range of architectures. In particular, we find that the difference in performance between finite and infinite width models vanishes at a definite rate with respect to model width. Nonetheless, this relation is consistent with finite width models generalizing either better or worse than their infinite width counterparts, and we provide examples where the relative performance depends on the optimization details.

Related papers

Robust Learning in Bayesian Parallel Branching Graph Neural Networks: The Narrow Width Limit [4.373803477995854]
We investigate the narrow width limit of the Bayesian Parallel Branching Graph Neural Network (BPB-GNN) We show that when the width of a BPB-GNN is significantly smaller compared to the number of training examples, each branch exhibits more robust learning. Our results characterize a newly defined narrow-width regime for parallel branching networks in general.
arXiv Detail & Related papers (2024-07-26T15:14:22Z)
Feature-Learning Networks Are Consistent Across Widths At Realistic Scales [72.27228085606147]
We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets. Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training. We observe, however, that ensembles of narrower networks perform worse than a single wide network.
arXiv Detail & Related papers (2023-05-28T17:09:32Z)
Bayesian Interpolation with Deep Linear Networks [92.1721532941863]
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory. We show that linear networks make provably optimal predictions at infinite depth. We also show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth.
arXiv Detail & Related papers (2022-12-29T20:57:46Z)
An Empirical Analysis of the Advantages of Finite- v.s. Infinite-Width Bayesian Neural Networks [25.135652514472238]
We empirically compare finite- and infinite-width BNNs, and provide quantitative and qualitative explanations for their performance difference. We find that when the model is mis-specified, increasing width can hurt BNN performance. In these cases, we provide evidence that finite-width BNNs generalize better partially due to the properties of their frequency spectrum that allows them to adapt under model mismatch.
arXiv Detail & Related papers (2022-11-16T20:07:55Z)
Deep Architecture Connectivity Matters for Its Convergence: A Fine-Grained Analysis [94.64007376939735]
We theoretically characterize the impact of connectivity patterns on the convergence of deep neural networks (DNNs) under gradient descent training. We show that by a simple filtration on "unpromising" connectivity patterns, we can trim down the number of models to evaluate.
arXiv Detail & Related papers (2022-05-11T17:43:54Z)
The Limitations of Large Width in Neural Networks: A Deep Gaussian Process Perspective [34.67386186205545]
This paper decouples capacity and width via the generalization of neural networks to Deep Gaussian Processes (Deep GP) Surprisingly, we prove that even nonparametric Deep GP converges to Gaussian processes, effectively becoming shallower without any increase in representational power. We find there is a "sweet spot" that maximizes test set performance before the limiting GP behavior prevents adaptability, occurring at width = 1 or width = 2 for nonparametric Deep GP.
arXiv Detail & Related papers (2021-06-11T17:58:58Z)
Explaining Neural Scaling Laws [17.115592382420626]
Population loss of trained deep neural networks often follows precise power-law scaling relations. We propose a theory that explains the origins of and connects these scaling laws. We identify variance-limited and resolution-limited scaling behavior for both dataset and model size.
arXiv Detail & Related papers (2021-02-12T18:57:46Z)
A Convergence Theory Towards Practical Over-parameterized Deep Neural Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time. We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both. Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z)
Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks. Centered and ensembled finite networks have reduced posterior variance. Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z)
On Infinite-Width Hypernetworks [101.03630454105621]
We show that hypernetworks do not guarantee to a global minima under descent. We identify the functional priors of these architectures by deriving their corresponding GP and NTK kernels. As part of this study, we make a mathematical contribution by deriving tight bounds on high order Taylor terms of standard fully connected ReLU networks.
arXiv Detail & Related papers (2020-03-27T00:50:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.