Normalization effects on shallow neural networks and related asymptotic
expansions
- URL: http://arxiv.org/abs/2011.10487v3
- Date: Wed, 1 Jun 2022 16:08:37 GMT
- Title: Normalization effects on shallow neural networks and related asymptotic
expansions
- Authors: Jiahui Yu and Konstantinos Spiliopoulos
- Abstract summary: In particular, we investigate the effect of different scaling schemes, which lead to different normalizations of the neural network, on the network's statistical output.
We develop an expansion for the neural network's statistical output pointwise with respect to the scaling parameter as the number of hidden units grows to infinity.
We show that to leading order in $N$, the variance of the neural network's statistical output decays as the implied normalization by the scaling parameter approaches the mean field normalization.
- Score: 20.48472873675696
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider shallow (single hidden layer) neural networks and characterize
their performance when trained with stochastic gradient descent as the number
of hidden units $N$ and gradient descent steps grow to infinity. In particular,
we investigate the effect of different scaling schemes, which lead to different
normalizations of the neural network, on the network's statistical output,
closing the gap between the $1/\sqrt{N}$ and the mean-field $1/N$
normalization. We develop an asymptotic expansion for the neural network's
statistical output pointwise with respect to the scaling parameter as the
number of hidden units grows to infinity. Based on this expansion, we
demonstrate mathematically that to leading order in $N$, there is no
bias-variance trade off, in that both bias and variance (both explicitly
characterized) decrease as the number of hidden units increases and time grows.
In addition, we show that to leading order in $N$, the variance of the neural
network's statistical output decays as the implied normalization by the scaling
parameter approaches the mean field normalization. Numerical studies on the
MNIST and CIFAR10 datasets show that test and train accuracy monotonically
improve as the neural network's normalization gets closer to the mean field
normalization.
Related papers
- The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich
Regimes [75.59720049837459]
We study the transition from infinite-width behavior to this variance limited regime as a function of sample size $P$ and network width $N$.
We find that finite-size effects can become relevant for very small datasets on the order of $P* sim sqrtN$ for regression with ReLU networks.
arXiv Detail & Related papers (2022-12-23T04:48:04Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Normalization effects on deep neural networks [20.48472873675696]
We study the effect of the choice of the $gamma_i$ on the statistical behavior of the neural network's output.
We find that in terms of variance of the neural network's output and test accuracy the best choice is to choose the $gamma_i$s to be equal to one.
arXiv Detail & Related papers (2022-09-02T17:05:55Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - On the Neural Tangent Kernel Analysis of Randomly Pruned Neural Networks [91.3755431537592]
We study how random pruning of the weights affects a neural network's neural kernel (NTK)
In particular, this work establishes an equivalence of the NTKs between a fully-connected neural network and its randomly pruned version.
arXiv Detail & Related papers (2022-03-27T15:22:19Z) - The Rate of Convergence of Variation-Constrained Deep Neural Networks [35.393855471751756]
We show that a class of variation-constrained neural networks can achieve near-parametric rate $n-1/2+delta$ for an arbitrarily small constant $delta$.
The result indicates that the neural function space needed for approximating smooth functions may not be as large as what is often perceived.
arXiv Detail & Related papers (2021-06-22T21:28:00Z) - Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks.
We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Fundamental tradeoffs between memorization and robustness in random
features and neural tangent regimes [15.76663241036412]
We prove for a large class of activation functions that, if the model memorizes even a fraction of the training, then its Sobolev-seminorm is lower-bounded.
Experiments reveal for the first time, (iv) a multiple-descent phenomenon in the robustness of the min-norm interpolator.
arXiv Detail & Related papers (2021-06-04T17:52:50Z) - The Efficacy of $L_1$ Regularization in Two-Layer Neural Networks [36.753907384994704]
A crucial problem in neural networks is to select the most appropriate number of hidden neurons and obtain tight statistical risk bounds.
We show that $L_1$ regularization can control the generalization error and sparsify the input dimension.
An excessively large number of neurons do not necessarily inflate generalization errors under a suitable regularization.
arXiv Detail & Related papers (2020-10-02T15:23:22Z) - The Interpolation Phase Transition in Neural Networks: Memorization and
Generalization under Lazy Training [10.72393527290646]
We study phenomena in the context of two-layers neural networks in the neural tangent (NT) regime.
We prove that as soon as $Ndgg n$, the test error is well approximated by one of kernel ridge regression with respect to the infinite-width kernel.
The latter is in turn well approximated by the error ridge regression, whereby the regularization parameter is increased by a self-induced' term related to the high-degree components of the activation function.
arXiv Detail & Related papers (2020-07-25T01:51:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.