ZerO Initialization: Initializing Residual Networks with only Zeros and
Ones
- URL: http://arxiv.org/abs/2110.12661v1
- Date: Mon, 25 Oct 2021 06:17:33 GMT
- Title: ZerO Initialization: Initializing Residual Networks with only Zeros and
Ones
- Authors: Jiawei Zhao, Florian Sch\"afer, Anima Anandkumar
- Abstract summary: Deep neural networks are usually with random weights, with adequately selected initial variance to ensure stable signal propagation during training.
There is no consensus on how to select the variance, and this becomes challenging as the number of layers grows.
In this work, we replace the widely used random weight initialization with a fully deterministic initialization scheme ZerO, which initializes residual networks with only zeros and ones.
Surprisingly, we find that ZerO achieves state-of-the-art performance over various image classification datasets, including ImageNet.
- Score: 44.66636787050788
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks are usually initialized with random weights, with
adequately selected initial variance to ensure stable signal propagation during
training. However, there is no consensus on how to select the variance, and
this becomes challenging especially as the number of layers grows. In this
work, we replace the widely used random weight initialization with a fully
deterministic initialization scheme ZerO, which initializes residual networks
with only zeros and ones. By augmenting the standard ResNet architectures with
a few extra skip connections and Hadamard transforms, ZerO allows us to start
the training from zeros and ones entirely. This has many benefits such as
improving reproducibility (by reducing the variance over different experimental
runs) and allowing network training without batch normalization. Surprisingly,
we find that ZerO achieves state-of-the-art performance over various image
classification datasets, including ImageNet, which suggests random weights may
be unnecessary for modern network initialization.
Related papers
- Random Weights Networks Work as Loss Prior Constraint for Image
Restoration [50.80507007507757]
We present our belief Random Weights Networks can be Acted as Loss Prior Constraint for Image Restoration''
Our belief can be directly inserted into existing networks without any training and testing computational cost.
To emphasize, our main focus is to spark the realms of loss function and save their current neglected status.
arXiv Detail & Related papers (2023-03-29T03:43:51Z) - Dynamical Isometry for Residual Networks [8.21292084298669]
We show that RISOTTO achieves perfect dynamical isometry for residual networks with ReLU activation functions even for finite depth and width.
In experiments, we demonstrate that our approach outperforms schemes proposed to make Batch Normalization obsolete, including Fixup and SkipInit.
arXiv Detail & Related papers (2022-10-05T17:33:23Z) - Training Your Sparse Neural Network Better with Any Mask [106.134361318518]
Pruning large neural networks to create high-quality, independently trainable sparse masks is desirable.
In this paper we demonstrate an alternative opportunity: one can customize the sparse training techniques to deviate from the default dense network training protocols.
Our new sparse training recipe is generally applicable to improving training from scratch with various sparse masks.
arXiv Detail & Related papers (2022-06-26T00:37:33Z) - Variance-Aware Weight Initialization for Point Convolutional Neural
Networks [23.46612653627991]
We propose a framework to unify the multitude of continuous convolutions.
We show that this framework can avoid batch normalization while achieving similar and, in some cases, better performance.
arXiv Detail & Related papers (2021-12-07T15:47:14Z) - Subquadratic Overparameterization for Shallow Neural Networks [60.721751363271146]
We provide an analytical framework that allows us to adopt standard neural training strategies.
We achieve the desiderata viaak-Lojasiewicz, smoothness, and standard assumptions.
arXiv Detail & Related papers (2021-11-02T20:24:01Z) - Beyond Signal Propagation: Is Feature Diversity Necessary in Deep Neural
Network Initialization? [31.122757815108884]
We construct a deep convolutional network with identical features by initializing almost all the weights to $0$.
The architecture also enables perfect signal propagation and stable gradients, and high accuracy on standard benchmarks.
arXiv Detail & Related papers (2020-07-02T11:49:17Z) - Fractional moment-preserving initialization schemes for training deep
neural networks [1.14219428942199]
A traditional approach to deep neural networks (DNNs) is to sample the network weights randomly for preserving the variance of pre-activations.
In this paper, we show that weights and therefore pre-activations can be modeled with a heavy-tailed distribution.
We show through numerical experiments that our schemes can improve the training and test performance.
arXiv Detail & Related papers (2020-05-25T01:10:01Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - MSE-Optimal Neural Network Initialization via Layer Fusion [68.72356718879428]
Deep neural networks achieve state-of-the-art performance for a range of classification and inference tasks.
The use of gradient combined nonvolutionity renders learning susceptible to novel problems.
We propose fusing neighboring layers of deeper networks that are trained with random variables.
arXiv Detail & Related papers (2020-01-28T18:25:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.