Wide neural networks: From non-gaussian random fields at initialization
to the NTK geometry of training
- URL: http://arxiv.org/abs/2304.03385v1
- Date: Thu, 6 Apr 2023 21:34:13 GMT
- Title: Wide neural networks: From non-gaussian random fields at initialization
to the NTK geometry of training
- Authors: Lu\'is Carvalho, Jo\~ao Lopes Costa, Jos\'e Mour\~ao, Gon\c{c}alo
Oliveira
- Abstract summary: Recent developments in applications of artificial neural networks with over $n=1014$ parameters make it extremely important to study the large $n$ behaviour of such networks.
Most works studying wide neural networks have focused on the infinite width $n to +infty$ limit of such networks.
In this work we will study their behavior for large, but finite $n$.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent developments in applications of artificial neural networks with over
$n=10^{14}$ parameters make it extremely important to study the large $n$
behaviour of such networks. Most works studying wide neural networks have
focused on the infinite width $n \to +\infty$ limit of such networks and have
shown that, at initialization, they correspond to Gaussian processes. In this
work we will study their behavior for large, but finite $n$. Our main
contributions are the following:
(1) The computation of the corrections to Gaussianity in terms of an
asymptotic series in $n^{-\frac{1}{2}}$. The coefficients in this expansion are
determined by the statistics of parameter initialization and by the activation
function.
(2) Controlling the evolution of the outputs of finite width $n$ networks,
during training, by computing deviations from the limiting infinite width case
(in which the network evolves through a linear flow). This improves previous
estimates and yields sharper decay rates for the (finite width) NTK in terms of
$n$, valid during the entire training procedure. As a corollary, we also prove
that, with arbitrarily high probability, the training of sufficiently wide
neural networks converges to a global minimum of the corresponding quadratic
loss function.
(3) Estimating how the deviations from Gaussianity evolve with training in
terms of $n$. In particular, using a certain metric in the space of measures we
find that, along training, the resulting measure is within
$n^{-\frac{1}{2}}(\log n)^{1+}$ of the time dependent Gaussian process
corresponding to the infinite width network (which is explicitly given by
precomposing the initial Gaussian process with the linear flow corresponding to
training in the infinite width limit).
Related papers
- Bayesian Inference with Deep Weakly Nonlinear Networks [57.95116787699412]
We show at a physics level of rigor that Bayesian inference with a fully connected neural network is solvable.
We provide techniques to compute the model evidence and posterior to arbitrary order in $1/N$ and at arbitrary temperature.
arXiv Detail & Related papers (2024-05-26T17:08:04Z) - Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - Deep Neural Network Initialization with Sparsity Inducing Activations [5.437298646956505]
We use the large width Gaussian process limit to analyze the behaviour of nonlinear activations that induce sparsity in the hidden outputs.
A previously unreported form of training instability is proven for arguably two of the most natural candidates for hidden layer sparsification.
We show that this instability is overcome by clipping the nonlinear activation magnitude, at a level prescribed by the shape of the associated Gaussian process variance map.
arXiv Detail & Related papers (2024-02-25T20:11:40Z) - The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich
Regimes [75.59720049837459]
We study the transition from infinite-width behavior to this variance limited regime as a function of sample size $P$ and network width $N$.
We find that finite-size effects can become relevant for very small datasets on the order of $P* sim sqrtN$ for regression with ReLU networks.
arXiv Detail & Related papers (2022-12-23T04:48:04Z) - Bounding the Width of Neural Networks via Coupled Initialization -- A
Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks.
We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z) - On the Neural Tangent Kernel Analysis of Randomly Pruned Neural Networks [91.3755431537592]
We study how random pruning of the weights affects a neural network's neural kernel (NTK)
In particular, this work establishes an equivalence of the NTKs between a fully-connected neural network and its randomly pruned version.
arXiv Detail & Related papers (2022-03-27T15:22:19Z) - Does Preprocessing Help Training Over-parameterized Neural Networks? [19.64638346701198]
We propose two novel preprocessing ideas to bypass the $Omega(mnd)$ barrier.
Our results provide theoretical insights for a large number of previously established fast training methods.
arXiv Detail & Related papers (2021-10-09T18:16:23Z) - The Rate of Convergence of Variation-Constrained Deep Neural Networks [35.393855471751756]
We show that a class of variation-constrained neural networks can achieve near-parametric rate $n-1/2+delta$ for an arbitrarily small constant $delta$.
The result indicates that the neural function space needed for approximating smooth functions may not be as large as what is often perceived.
arXiv Detail & Related papers (2021-06-22T21:28:00Z) - Deep neural network approximation of analytic functions [91.3755431537592]
entropy bound for the spaces of neural networks with piecewise linear activation functions.
We derive an oracle inequality for the expected error of the considered penalized deep neural network estimators.
arXiv Detail & Related papers (2021-04-05T18:02:04Z) - Large-width functional asymptotics for deep Gaussian neural networks [2.7561479348365734]
We consider fully connected feed-forward deep neural networks where weights and biases are independent and identically distributed according to Gaussian distributions.
Our results contribute to recent theoretical studies on the interplay between infinitely wide deep neural networks and processes.
arXiv Detail & Related papers (2021-02-20T10:14:37Z) - Infinitely Wide Tensor Networks as Gaussian Process [1.7894377200944511]
In this paper, we show the equivalence of the infinitely wide Networks and the Gaussian Process.
We implement the Gaussian Process corresponding to the infinite limit tensor networks and plot the sample paths of these models.
arXiv Detail & Related papers (2021-01-07T02:29:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.