Dynamical Isometry for Residual Networks
- URL: http://arxiv.org/abs/2210.02411v1
- Date: Wed, 5 Oct 2022 17:33:23 GMT
- Title: Dynamical Isometry for Residual Networks
- Authors: Advait Gadhikar and Rebekka Burkholz
- Abstract summary: We show that RISOTTO achieves perfect dynamical isometry for residual networks with ReLU activation functions even for finite depth and width.
In experiments, we demonstrate that our approach outperforms schemes proposed to make Batch Normalization obsolete, including Fixup and SkipInit.
- Score: 8.21292084298669
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The training success, training speed and generalization ability of neural
networks rely crucially on the choice of random parameter initialization. It
has been shown for multiple architectures that initial dynamical isometry is
particularly advantageous. Known initialization schemes for residual blocks,
however, miss this property and suffer from degrading separability of different
inputs for increasing depth and instability without Batch Normalization or lack
feature diversity. We propose a random initialization scheme, RISOTTO, that
achieves perfect dynamical isometry for residual networks with ReLU activation
functions even for finite depth and width. It balances the contributions of the
residual and skip branches unlike other schemes, which initially bias towards
the skip connections. In experiments, we demonstrate that in most cases our
approach outperforms initialization schemes proposed to make Batch
Normalization obsolete, including Fixup and SkipInit, and facilitates stable
training. Also in combination with Batch Normalization, we find that RISOTTO
often achieves the overall best result.
Related papers
- Principles for Initialization and Architecture Selection in Graph Neural
Networks with ReLU Activations [17.51364577113718]
We show three principles for architecture selection in finite width graph neural networks (GNNs) with ReLU activations.
First, we theoretically derive what is essentially the unique generalization to ReLU GNNs of the well-known He-initialization.
Second, we prove in finite width vanilla ReLU GNNs that oversmoothing is unavoidable at large depth when using fixed aggregation operator.
arXiv Detail & Related papers (2023-06-20T16:40:41Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Subquadratic Overparameterization for Shallow Neural Networks [60.721751363271146]
We provide an analytical framework that allows us to adopt standard neural training strategies.
We achieve the desiderata viaak-Lojasiewicz, smoothness, and standard assumptions.
arXiv Detail & Related papers (2021-11-02T20:24:01Z) - ZerO Initialization: Initializing Residual Networks with only Zeros and
Ones [44.66636787050788]
Deep neural networks are usually with random weights, with adequately selected initial variance to ensure stable signal propagation during training.
There is no consensus on how to select the variance, and this becomes challenging as the number of layers grows.
In this work, we replace the widely used random weight initialization with a fully deterministic initialization scheme ZerO, which initializes residual networks with only zeros and ones.
Surprisingly, we find that ZerO achieves state-of-the-art performance over various image classification datasets, including ImageNet.
arXiv Detail & Related papers (2021-10-25T06:17:33Z) - Path Regularization: A Convexity and Sparsity Inducing Regularization
for Parallel ReLU Networks [75.33431791218302]
We study the training problem of deep neural networks and introduce an analytic approach to unveil hidden convexity in the optimization landscape.
We consider a deep parallel ReLU network architecture, which also includes standard deep networks and ResNets as its special cases.
arXiv Detail & Related papers (2021-10-18T18:00:36Z) - On the Explicit Role of Initialization on the Convergence and Implicit
Bias of Overparametrized Linear Networks [1.0323063834827415]
We present a novel analysis of single-hidden-layer linear networks trained under gradient flow.
We show that the squared loss converges exponentially to its optimum.
We derive a novel non-asymptotic upper-bound on the distance between the trained network and the min-norm solution.
arXiv Detail & Related papers (2021-05-13T15:13:51Z) - GradInit: Learning to Initialize Neural Networks for Stable and
Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks.
It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value.
It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z) - Activation Relaxation: A Local Dynamical Approximation to
Backpropagation in the Brain [62.997667081978825]
Activation Relaxation (AR) is motivated by constructing the backpropagation gradient as the equilibrium point of a dynamical system.
Our algorithm converges rapidly and robustly to the correct backpropagation gradients, requires only a single type of computational unit, and can operate on arbitrary computation graphs.
arXiv Detail & Related papers (2020-09-11T11:56:34Z) - Beyond Signal Propagation: Is Feature Diversity Necessary in Deep Neural
Network Initialization? [31.122757815108884]
We construct a deep convolutional network with identical features by initializing almost all the weights to $0$.
The architecture also enables perfect signal propagation and stable gradients, and high accuracy on standard benchmarks.
arXiv Detail & Related papers (2020-07-02T11:49:17Z) - Fractional moment-preserving initialization schemes for training deep
neural networks [1.14219428942199]
A traditional approach to deep neural networks (DNNs) is to sample the network weights randomly for preserving the variance of pre-activations.
In this paper, we show that weights and therefore pre-activations can be modeled with a heavy-tailed distribution.
We show through numerical experiments that our schemes can improve the training and test performance.
arXiv Detail & Related papers (2020-05-25T01:10:01Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.