Robust Learning in Bayesian Parallel Branching Graph Neural Networks: The Narrow Width Limit
- URL: http://arxiv.org/abs/2407.18807v2
- Date: Fri, 4 Oct 2024 02:40:00 GMT
- Title: Robust Learning in Bayesian Parallel Branching Graph Neural Networks: The Narrow Width Limit
- Authors: Zechen Zhang, Haim Sompolinsky,
- Abstract summary: We investigate the narrow width limit of the Bayesian Parallel Branching Graph Neural Network (BPB-GNN)
We show that when the width of a BPB-GNN is significantly smaller compared to the number of training examples, each branch exhibits more robust learning.
Our results characterize a newly defined narrow-width regime for parallel branching networks in general.
- Score: 4.373803477995854
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The infinite width limit of random neural networks is known to result in Neural Networks as Gaussian Process (NNGP) (Lee et al. [2018]), characterized by task-independent kernels. It is widely accepted that larger network widths contribute to improved generalization (Park et al. [2019]). However, this work challenges this notion by investigating the narrow width limit of the Bayesian Parallel Branching Graph Neural Network (BPB-GNN), an architecture that resembles residual networks. We demonstrate that when the width of a BPB-GNN is significantly smaller compared to the number of training examples, each branch exhibits more robust learning due to a symmetry breaking of branches in kernel renormalization. Surprisingly, the performance of a BPB-GNN in the narrow width limit is generally superior or comparable to that achieved in the wide width limit in bias-limited scenarios. Furthermore, the readout norms of each branch in the narrow width limit are mostly independent of the architectural hyperparameters but generally reflective of the nature of the data. Our results characterize a newly defined narrow-width regime for parallel branching networks in general.
Related papers
- Feature-Learning Networks Are Consistent Across Widths At Realistic
Scales [72.27228085606147]
We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets.
Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training.
We observe, however, that ensembles of narrower networks perform worse than a single wide network.
arXiv Detail & Related papers (2023-05-28T17:09:32Z) - Globally Gated Deep Linear Networks [3.04585143845864]
We introduce Globally Gated Deep Linear Networks (GGDLNs) where gating units are shared among all processing units in each layer.
We derive exact equations for the generalization properties in these networks in the finite-width thermodynamic limit.
Our work is the first exact theoretical solution of learning in a family of nonlinear networks with finite width.
arXiv Detail & Related papers (2022-10-31T16:21:56Z) - Wide Bayesian neural networks have a simple weight posterior: theory and
accelerated sampling [48.94555574632823]
Repriorisation transforms a Bayesian neural network (BNN) posterior to a distribution whose KL divergence to the BNN prior vanishes as layer widths grow.
We develop a Markov chain Monte Carlo (MCMC) posterior sampling algorithm which mixes faster the wider the BNN.
We observe up to 50x higher effective sample size relative to no reparametrisation for both fully-connected and residual networks.
arXiv Detail & Related papers (2022-06-15T17:11:08Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - Tight Bounds on the Smallest Eigenvalue of the Neural Tangent Kernel for
Deep ReLU Networks [21.13299067136635]
We provide tight bounds on the smallest eigenvalue of NTK matrices for deep ReLU networks.
In the finite-width setting, the network architectures we consider are quite general.
arXiv Detail & Related papers (2020-12-21T19:32:17Z) - Asymptotics of Wide Convolutional Neural Networks [18.198962344790377]
We study scaling laws for wide CNNs and networks with skip connections.
We find that the difference in performance between finite and infinite width models vanishes at a definite rate with respect to model width.
arXiv Detail & Related papers (2020-08-19T21:22:19Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - Bayesian Deep Ensembles via the Neural Tangent Kernel [49.569912265882124]
We explore the link between deep ensembles and Gaussian processes (GPs) through the lens of the Neural Tangent Kernel (NTK)
We introduce a simple modification to standard deep ensembles training, through addition of a computationally-tractable, randomised and untrainable function to each ensemble member.
We prove that our Bayesian deep ensembles make more conservative predictions than standard deep ensembles in the infinite width limit.
arXiv Detail & Related papers (2020-07-11T22:10:52Z) - On Random Kernels of Residual Architectures [93.94469470368988]
We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets.
Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity.
In DenseNets, however, convergence of the NTK to its limit as the width tends to infinity is guaranteed.
arXiv Detail & Related papers (2020-01-28T16:47:53Z) - Wide Neural Networks with Bottlenecks are Deep Gaussian Processes [2.6641834518599308]
We consider the wide limit of BNNs where some hidden layers, called "bottlenecks", are held at finite width.
Although intuitive, the subtlety of the proof is in showing that the wide limit of a composition of networks is in fact the composition of the limiting GPs.
We also analyze theoretically a single-bottleneck NNGP, finding that the bottleneck induces dependence between the outputs of a multi-output network that persists through extreme post-bottleneck depths.
arXiv Detail & Related papers (2020-01-03T18:13:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.