Bayesian Deep Ensembles via the Neural Tangent Kernel
- URL: http://arxiv.org/abs/2007.05864v2
- Date: Sat, 24 Oct 2020 16:51:14 GMT
- Title: Bayesian Deep Ensembles via the Neural Tangent Kernel
- Authors: Bobby He, Balaji Lakshminarayanan and Yee Whye Teh
- Abstract summary: We explore the link between deep ensembles and Gaussian processes (GPs) through the lens of the Neural Tangent Kernel (NTK)
We introduce a simple modification to standard deep ensembles training, through addition of a computationally-tractable, randomised and untrainable function to each ensemble member.
We prove that our Bayesian deep ensembles make more conservative predictions than standard deep ensembles in the infinite width limit.
- Score: 49.569912265882124
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We explore the link between deep ensembles and Gaussian processes (GPs)
through the lens of the Neural Tangent Kernel (NTK): a recent development in
understanding the training dynamics of wide neural networks (NNs). Previous
work has shown that even in the infinite width limit, when NNs become GPs,
there is no GP posterior interpretation to a deep ensemble trained with squared
error loss. We introduce a simple modification to standard deep ensembles
training, through addition of a computationally-tractable, randomised and
untrainable function to each ensemble member, that enables a posterior
interpretation in the infinite width limit. When ensembled together, our
trained NNs give an approximation to a posterior predictive distribution, and
we prove that our Bayesian deep ensembles make more conservative predictions
than standard deep ensembles in the infinite width limit. Finally, using finite
width NNs we demonstrate that our Bayesian deep ensembles faithfully emulate
the analytic posterior predictive when available, and can outperform standard
deep ensembles in various out-of-distribution settings, for both regression and
classification tasks.
Related papers
- Generalization of Scaled Deep ResNets in the Mean-Field Regime [55.77054255101667]
We investigate emphscaled ResNet in the limit of infinitely deep and wide neural networks.
Our results offer new insights into the generalization ability of deep ResNet beyond the lazy training regime.
arXiv Detail & Related papers (2024-03-14T21:48:00Z) - An Infinite-Width Analysis on the Jacobian-Regularised Training of a Neural Network [10.384951432591492]
Recent theoretical analysis of deep neural networks in their infinite-width limits has deepened our understanding of initialisation, feature learning, and training of those networks.
We show that this infinite-width analysis can be extended to the Jacobian of a deep neural network.
We experimentally show the relevance of our theoretical claims to wide finite networks, and empirically analyse the properties of kernel regression solution to obtain an insight into Jacobian regularisation.
arXiv Detail & Related papers (2023-12-06T09:52:18Z) - Wide Neural Networks as Gaussian Processes: Lessons from Deep
Equilibrium Models [16.07760622196666]
We study the deep equilibrium model (DEQ), an infinite-depth neural network with shared weight matrices across layers.
Our analysis reveals that as the width of DEQ layers approaches infinity, it converges to a Gaussian process.
Remarkably, this convergence holds even when the limits of depth and width are interchanged.
arXiv Detail & Related papers (2023-10-16T19:00:43Z) - Feature Learning and Generalization in Deep Networks with Orthogonal Weights [1.7956122940209063]
Deep neural networks with numerically weights from independent Gaussian distributions can be tuned to criticality.
These networks still exhibit fluctuations that grow linearly with the depth of the network.
We show analytically that rectangular networks with tanh activations and weights from the ensemble of matrices have corresponding preactivation fluctuations.
arXiv Detail & Related papers (2023-10-11T18:00:02Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Deep Stable neural networks: large-width asymptotics and convergence
rates [3.0108936184913295]
We show that as the width goes to infinity jointly over the NN's layers, a suitable rescaled deep Stable NN converges weakly to a Stable SP.
Because of the non-triangular NN's structure, this is a non-standard problem, to which we propose a novel and self-contained inductive approach.
arXiv Detail & Related papers (2021-08-02T12:18:00Z) - Random Neural Networks in the Infinite Width Limit as Gaussian Processes [16.75218291152252]
This article gives a new proof that fully connected neural networks with random weights and biases converge to Gaussian processes in the regime where the input dimension, output dimension, and depth are kept fixed.
Unlike prior work, convergence is shown assuming only moment conditions for the distribution of weights and for quite general non-linearities.
arXiv Detail & Related papers (2021-07-04T07:00:20Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - On Random Kernels of Residual Architectures [93.94469470368988]
We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets.
Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity.
In DenseNets, however, convergence of the NTK to its limit as the width tends to infinity is guaranteed.
arXiv Detail & Related papers (2020-01-28T16:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.