Fractional moment-preserving initialization schemes for training deep
neural networks
- URL: http://arxiv.org/abs/2005.11878v5
- Date: Sat, 13 Feb 2021 15:23:47 GMT
- Title: Fractional moment-preserving initialization schemes for training deep
neural networks
- Authors: Mert Gurbuzbalaban, Yuanhan Hu
- Abstract summary: A traditional approach to deep neural networks (DNNs) is to sample the network weights randomly for preserving the variance of pre-activations.
In this paper, we show that weights and therefore pre-activations can be modeled with a heavy-tailed distribution.
We show through numerical experiments that our schemes can improve the training and test performance.
- Score: 1.14219428942199
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A traditional approach to initialization in deep neural networks (DNNs) is to
sample the network weights randomly for preserving the variance of
pre-activations. On the other hand, several studies show that during the
training process, the distribution of stochastic gradients can be heavy-tailed
especially for small batch sizes. In this case, weights and therefore
pre-activations can be modeled with a heavy-tailed distribution that has an
infinite variance but has a finite (non-integer) fractional moment of order $s$
with $s<2$. Motivated by this fact, we develop initialization schemes for fully
connected feed-forward networks that can provably preserve any given moment of
order $s \in (0, 2]$ over the layers for a class of activations including ReLU,
Leaky ReLU, Randomized Leaky ReLU, and linear activations. These generalized
schemes recover traditional initialization schemes in the limit $s \to 2$ and
serve as part of a principled theory for initialization. For all these schemes,
we show that the network output admits a finite almost sure limit as the number
of layers grows, and the limit is heavy-tailed in some settings. This sheds
further light into the origins of heavy tail during signal propagation in DNNs.
We prove that the logarithm of the norm of the network outputs, if properly
scaled, will converge to a Gaussian distribution with an explicit mean and
variance we can compute depending on the activation used, the value of s chosen
and the network width. We also prove that our initialization scheme avoids
small network output values more frequently compared to traditional approaches.
Furthermore, the proposed initialization strategy does not have an extra cost
during the training procedure. We show through numerical experiments that our
initialization can improve the training and test performance.
Related papers
- Deep activity propagation via weight initialization in spiking neural networks [10.69085409825724]
Spiking Neural Networks (SNNs) offer bio-inspired advantages such as sparsity and ultra-low power consumption.
Deep SNNs process and transmit information by quantizing the real-valued membrane potentials into binary spikes.
We show theoretically that, unlike standard approaches, this method enables the propagation of activity in deep SNNs without loss of spikes.
arXiv Detail & Related papers (2024-10-01T11:02:34Z) - Optimization Guarantees of Unfolded ISTA and ADMM Networks With Smooth
Soft-Thresholding [57.71603937699949]
We study optimization guarantees, i.e., achieving near-zero training loss with the increase in the number of learning epochs.
We show that the threshold on the number of training samples increases with the increase in the network width.
arXiv Detail & Related papers (2023-09-12T13:03:47Z) - Principles for Initialization and Architecture Selection in Graph Neural
Networks with ReLU Activations [17.51364577113718]
We show three principles for architecture selection in finite width graph neural networks (GNNs) with ReLU activations.
First, we theoretically derive what is essentially the unique generalization to ReLU GNNs of the well-known He-initialization.
Second, we prove in finite width vanilla ReLU GNNs that oversmoothing is unavoidable at large depth when using fixed aggregation operator.
arXiv Detail & Related papers (2023-06-20T16:40:41Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Dynamical Isometry for Residual Networks [8.21292084298669]
We show that RISOTTO achieves perfect dynamical isometry for residual networks with ReLU activation functions even for finite depth and width.
In experiments, we demonstrate that our approach outperforms schemes proposed to make Batch Normalization obsolete, including Fixup and SkipInit.
arXiv Detail & Related papers (2022-10-05T17:33:23Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Subquadratic Overparameterization for Shallow Neural Networks [60.721751363271146]
We provide an analytical framework that allows us to adopt standard neural training strategies.
We achieve the desiderata viaak-Lojasiewicz, smoothness, and standard assumptions.
arXiv Detail & Related papers (2021-11-02T20:24:01Z) - ZerO Initialization: Initializing Residual Networks with only Zeros and
Ones [44.66636787050788]
Deep neural networks are usually with random weights, with adequately selected initial variance to ensure stable signal propagation during training.
There is no consensus on how to select the variance, and this becomes challenging as the number of layers grows.
In this work, we replace the widely used random weight initialization with a fully deterministic initialization scheme ZerO, which initializes residual networks with only zeros and ones.
Surprisingly, we find that ZerO achieves state-of-the-art performance over various image classification datasets, including ImageNet.
arXiv Detail & Related papers (2021-10-25T06:17:33Z) - On the Explicit Role of Initialization on the Convergence and Implicit
Bias of Overparametrized Linear Networks [1.0323063834827415]
We present a novel analysis of single-hidden-layer linear networks trained under gradient flow.
We show that the squared loss converges exponentially to its optimum.
We derive a novel non-asymptotic upper-bound on the distance between the trained network and the min-norm solution.
arXiv Detail & Related papers (2021-05-13T15:13:51Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.