Predicting the outputs of finite deep neural networks trained with noisy
gradients
- URL: http://arxiv.org/abs/2004.01190v3
- Date: Thu, 30 Sep 2021 07:19:27 GMT
- Title: Predicting the outputs of finite deep neural networks trained with noisy
gradients
- Authors: Gadi Naveh, Oded Ben-David, Haim Sompolinsky and Zohar Ringel
- Abstract summary: A recent line of works studied wide deep neural networks (DNNs) by approximating them as Gaussian Processes (GPs)
Here we consider a DNN training protocol involving noise, weight decay and finite width, whose outcome corresponds to a certain non-Gaussian process.
An analytical framework is then introduced to analyze this non-Gaussian process, whose deviation from a GP is controlled by the finite width.
- Score: 1.1470070927586014
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A recent line of works studied wide deep neural networks (DNNs) by
approximating them as Gaussian Processes (GPs). A DNN trained with gradient
flow was shown to map to a GP governed by the Neural Tangent Kernel (NTK),
whereas earlier works showed that a DNN with an i.i.d. prior over its weights
maps to the so-called Neural Network Gaussian Process (NNGP). Here we consider
a DNN training protocol, involving noise, weight decay and finite width, whose
outcome corresponds to a certain non-Gaussian stochastic process. An analytical
framework is then introduced to analyze this non-Gaussian process, whose
deviation from a GP is controlled by the finite width. Our contribution is
three-fold: (i) In the infinite width limit, we establish a correspondence
between DNNs trained with noisy gradients and the NNGP, not the NTK. (ii) We
provide a general analytical form for the finite width correction (FWC) for
DNNs with arbitrary activation functions and depth and use it to predict the
outputs of empirical finite networks with high accuracy. Analyzing the FWC
behavior as a function of $n$, the training set size, we find that it is
negligible for both the very small $n$ regime, and, surprisingly, for the large
$n$ regime (where the GP error scales as $O(1/n)$). (iii) We flesh out
algebraically how these FWCs can improve the performance of finite
convolutional neural networks (CNNs) relative to their GP counterparts on image
classification tasks.
Related papers
- Graph Neural Networks Do Not Always Oversmooth [46.57665708260211]
We study oversmoothing in graph convolutional networks (GCNs) by using their Gaussian process (GP) equivalence in the limit of infinitely many hidden features.
We identify a new, non-oversmoothing phase: if the initial weights of the network have sufficiently large variance, GCNs do not oversmooth, and node features remain informative even at large depth.
arXiv Detail & Related papers (2024-06-04T12:47:13Z) - Speed Limits for Deep Learning [67.69149326107103]
Recent advancement in thermodynamics allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network.
We provide analytical expressions for these speed limits for linear and linearizable neural networks.
Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense.
arXiv Detail & Related papers (2023-07-27T06:59:46Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Superiority of GNN over NN in generalizing bandlimited functions [6.3151583550712065]
Graph Neural Networks (GNNs) have emerged as formidable resources for processing graph-based information across diverse applications.
In this study, we investigate the proficiency of GNNs for such classifications, which can also be cast as a function problem.
Our findings highlight a pronounced efficiency in utilizing GNNs to generalize a bandlimited function within an $varepsilon$-error margin.
arXiv Detail & Related papers (2022-06-13T05:15:12Z) - A self consistent theory of Gaussian Processes captures feature learning
effects in finite CNNs [2.28438857884398]
Deep neural networks (DNNs) in the infinite width/channel limit have received much attention recently.
Despite their theoretical appeal, this viewpoint lacks a crucial ingredient of deep learning in finite DNNs, laying at the heart of their success -- feature learning.
Here we consider DNNs trained with noisy gradient descent on a large training set and derive a self consistent Gaussian Process theory accounting for strong finite-DNN and feature learning effects.
arXiv Detail & Related papers (2021-06-08T05:20:00Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - When and why PINNs fail to train: A neural tangent kernel perspective [2.1485350418225244]
We derive the Neural Tangent Kernel (NTK) of PINNs and prove that, under appropriate conditions, it converges to a deterministic kernel that stays constant during training in the infinite-width limit.
We find a remarkable discrepancy in the convergence rate of the different loss components contributing to the total training error.
We propose a novel gradient descent algorithm that utilizes the eigenvalues of the NTK to adaptively calibrate the convergence rate of the total training error.
arXiv Detail & Related papers (2020-07-28T23:44:56Z) - Characteristics of Monte Carlo Dropout in Wide Neural Networks [16.639005039546745]
Monte Carlo (MC) dropout is one of the state-of-the-art approaches for uncertainty estimation in neural networks (NNs)
We study the limiting distribution of wide untrained NNs under dropout more rigorously and prove that they as well converge to Gaussian processes for fixed sets of weights and biases.
We investigate how (strongly) correlated pre-activations can induce non-Gaussian behavior in NNs with strongly correlated weights.
arXiv Detail & Related papers (2020-07-10T15:14:43Z) - Exact posterior distributions of wide Bayesian neural networks [51.20413322972014]
We show that the exact BNN posterior converges (weakly) to the one induced by the GP limit of the prior.
For empirical validation, we show how to generate exact samples from a finite BNN on a small dataset via rejection sampling.
arXiv Detail & Related papers (2020-06-18T13:57:04Z) - Infinitely Wide Graph Convolutional Networks: Semi-supervised Learning
via Gaussian Processes [144.6048446370369]
Graph convolutional neural networks(GCNs) have recently demonstrated promising results on graph-based semi-supervised classification.
We propose a GP regression model via GCNs(GPGC) for graph-based semi-supervised learning.
We conduct extensive experiments to evaluate GPGC and demonstrate that it outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2020-02-26T10:02:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.