Characteristics of Monte Carlo Dropout in Wide Neural Networks
- URL: http://arxiv.org/abs/2007.05434v1
- Date: Fri, 10 Jul 2020 15:14:43 GMT
- Title: Characteristics of Monte Carlo Dropout in Wide Neural Networks
- Authors: Joachim Sicking, Maram Akila, Tim Wirtz, Sebastian Houben, Asja
Fischer
- Abstract summary: Monte Carlo (MC) dropout is one of the state-of-the-art approaches for uncertainty estimation in neural networks (NNs)
We study the limiting distribution of wide untrained NNs under dropout more rigorously and prove that they as well converge to Gaussian processes for fixed sets of weights and biases.
We investigate how (strongly) correlated pre-activations can induce non-Gaussian behavior in NNs with strongly correlated weights.
- Score: 16.639005039546745
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Monte Carlo (MC) dropout is one of the state-of-the-art approaches for
uncertainty estimation in neural networks (NNs). It has been interpreted as
approximately performing Bayesian inference. Based on previous work on the
approximation of Gaussian processes by wide and deep neural networks with
random weights, we study the limiting distribution of wide untrained NNs under
dropout more rigorously and prove that they as well converge to Gaussian
processes for fixed sets of weights and biases. We sketch an argument that this
property might also hold for infinitely wide feed-forward networks that are
trained with (full-batch) gradient descent. The theory is contrasted by an
empirical analysis in which we find correlations and non-Gaussian behaviour for
the pre-activations of finite width NNs. We therefore investigate how
(strongly) correlated pre-activations can induce non-Gaussian behavior in NNs
with strongly correlated weights.
Related papers
- Wide Deep Neural Networks with Gaussian Weights are Very Close to
Gaussian Processes [1.0878040851638]
We show that the distance between the network output and the corresponding Gaussian approximation scales inversely with the width of the network, exhibiting faster convergence than the naive suggested by the central limit theorem.
We also apply our bounds to obtain theoretical approximations for the exact posterior distribution of the network, when the likelihood is a bounded Lipschitz function of the network output evaluated on a (finite) training set.
arXiv Detail & Related papers (2023-12-18T22:29:40Z) - Wide Neural Networks as Gaussian Processes: Lessons from Deep
Equilibrium Models [16.07760622196666]
We study the deep equilibrium model (DEQ), an infinite-depth neural network with shared weight matrices across layers.
Our analysis reveals that as the width of DEQ layers approaches infinity, it converges to a Gaussian process.
Remarkably, this convergence holds even when the limits of depth and width are interchanged.
arXiv Detail & Related papers (2023-10-16T19:00:43Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Posterior Inference on Shallow Infinitely Wide Bayesian Neural Networks under Weights with Unbounded Variance [1.5960546024967326]
It is known that the infinite width scaling limit of a Bayesian neural network with one hidden layer is a Gaussian process, when the network weights have bounded prior variance.
Neal's result has been extended to networks with multiple hidden layers and to convolutional neural networks, also with Gaussian process scaling limits.
Our contribution is an interpretable and computationally efficient procedure for posterior inference, using a conditionally Gaussian representation, that then allows full use of the Gaussian process machinery for tractable posterior inference and uncertainty quantification in the non-Gaussian regime.
arXiv Detail & Related papers (2023-05-18T02:55:00Z) - Bayesian inference with finitely wide neural networks [0.4568777157687961]
We propose a non-Gaussian distribution in differential form to model a finite set of outputs from a random neural network.
We are able to derive the non-Gaussian posterior distribution in Bayesian regression task.
arXiv Detail & Related papers (2023-03-06T03:25:30Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - On the Neural Tangent Kernel Analysis of Randomly Pruned Neural Networks [91.3755431537592]
We study how random pruning of the weights affects a neural network's neural kernel (NTK)
In particular, this work establishes an equivalence of the NTKs between a fully-connected neural network and its randomly pruned version.
arXiv Detail & Related papers (2022-03-27T15:22:19Z) - Bayesian Deep Ensembles via the Neural Tangent Kernel [49.569912265882124]
We explore the link between deep ensembles and Gaussian processes (GPs) through the lens of the Neural Tangent Kernel (NTK)
We introduce a simple modification to standard deep ensembles training, through addition of a computationally-tractable, randomised and untrainable function to each ensemble member.
We prove that our Bayesian deep ensembles make more conservative predictions than standard deep ensembles in the infinite width limit.
arXiv Detail & Related papers (2020-07-11T22:10:52Z) - Predicting the outputs of finite deep neural networks trained with noisy
gradients [1.1470070927586014]
A recent line of works studied wide deep neural networks (DNNs) by approximating them as Gaussian Processes (GPs)
Here we consider a DNN training protocol involving noise, weight decay and finite width, whose outcome corresponds to a certain non-Gaussian process.
An analytical framework is then introduced to analyze this non-Gaussian process, whose deviation from a GP is controlled by the finite width.
arXiv Detail & Related papers (2020-04-02T18:00:01Z) - Bayesian Deep Learning and a Probabilistic Perspective of Generalization [56.69671152009899]
We show that deep ensembles provide an effective mechanism for approximate Bayesian marginalization.
We also propose a related approach that further improves the predictive distribution by marginalizing within basins of attraction.
arXiv Detail & Related papers (2020-02-20T15:13:27Z) - A Generalized Neural Tangent Kernel Analysis for Two-layer Neural
Networks [87.23360438947114]
We show that noisy gradient descent with weight decay can still exhibit a " Kernel-like" behavior.
This implies that the training loss converges linearly up to a certain accuracy.
We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.
arXiv Detail & Related papers (2020-02-10T18:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.