Related papers: Dynamical Isometry for Residual Networks

Dynamical Isometry for Residual Networks

URL: http://arxiv.org/abs/2210.02411v1
Date: Wed, 5 Oct 2022 17:33:23 GMT
Title: Dynamical Isometry for Residual Networks
Authors: Advait Gadhikar and Rebekka Burkholz
Abstract summary: We show that RISOTTO achieves perfect dynamical isometry for residual networks with ReLU activation functions even for finite depth and width. In experiments, we demonstrate that our approach outperforms schemes proposed to make Batch Normalization obsolete, including Fixup and SkipInit.
Score: 8.21292084298669
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The training success, training speed and generalization ability of neural networks rely crucially on the choice of random parameter initialization. It has been shown for multiple architectures that initial dynamical isometry is particularly advantageous. Known initialization schemes for residual blocks, however, miss this property and suffer from degrading separability of different inputs for increasing depth and instability without Batch Normalization or lack feature diversity. We propose a random initialization scheme, RISOTTO, that achieves perfect dynamical isometry for residual networks with ReLU activation functions even for finite depth and width. It balances the contributions of the residual and skip branches unlike other schemes, which initially bias towards the skip connections. In experiments, we demonstrate that in most cases our approach outperforms initialization schemes proposed to make Batch Normalization obsolete, including Fixup and SkipInit, and facilitates stable training. Also in combination with Batch Normalization, we find that RISOTTO often achieves the overall best result.

Related papers

Principles for Initialization and Architecture Selection in Graph Neural Networks with ReLU Activations [17.51364577113718]
We show three principles for architecture selection in finite width graph neural networks (GNNs) with ReLU activations. First, we theoretically derive what is essentially the unique generalization to ReLU GNNs of the well-known He-initialization. Second, we prove in finite width vanilla ReLU GNNs that oversmoothing is unavoidable at large depth when using fixed aggregation operator.
arXiv Detail & Related papers (2023-06-20T16:40:41Z)
Globally Optimal Training of Neural Networks with Threshold Activation Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations. We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z)
Subquadratic Overparameterization for Shallow Neural Networks [60.721751363271146]
We provide an analytical framework that allows us to adopt standard neural training strategies. We achieve the desiderata viaak-Lojasiewicz, smoothness, and standard assumptions.
arXiv Detail & Related papers (2021-11-02T20:24:01Z)
ZerO Initialization: Initializing Residual Networks with only Zeros and Ones [44.66636787050788]
Deep neural networks are usually with random weights, with adequately selected initial variance to ensure stable signal propagation during training. There is no consensus on how to select the variance, and this becomes challenging as the number of layers grows. In this work, we replace the widely used random weight initialization with a fully deterministic initialization scheme ZerO, which initializes residual networks with only zeros and ones. Surprisingly, we find that ZerO achieves state-of-the-art performance over various image classification datasets, including ImageNet.
arXiv Detail & Related papers (2021-10-25T06:17:33Z)
Path Regularization: A Convexity and Sparsity Inducing Regularization for Parallel ReLU Networks [75.33431791218302]
We study the training problem of deep neural networks and introduce an analytic approach to unveil hidden convexity in the optimization landscape. We consider a deep parallel ReLU network architecture, which also includes standard deep networks and ResNets as its special cases.
arXiv Detail & Related papers (2021-10-18T18:00:36Z)
On the Explicit Role of Initialization on the Convergence and Implicit Bias of Overparametrized Linear Networks [1.0323063834827415]
We present a novel analysis of single-hidden-layer linear networks trained under gradient flow. We show that the squared loss converges exponentially to its optimum. We derive a novel non-asymptotic upper-bound on the distance between the trained network and the min-norm solution.
arXiv Detail & Related papers (2021-05-13T15:13:51Z)
GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks. It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value. It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z)
Activation Relaxation: A Local Dynamical Approximation to Backpropagation in the Brain [62.997667081978825]
Activation Relaxation (AR) is motivated by constructing the backpropagation gradient as the equilibrium point of a dynamical system. Our algorithm converges rapidly and robustly to the correct backpropagation gradients, requires only a single type of computational unit, and can operate on arbitrary computation graphs.
arXiv Detail & Related papers (2020-09-11T11:56:34Z)
Beyond Signal Propagation: Is Feature Diversity Necessary in Deep Neural Network Initialization? [31.122757815108884]
We construct a deep convolutional network with identical features by initializing almost all the weights to $0$. The architecture also enables perfect signal propagation and stable gradients, and high accuracy on standard benchmarks.
arXiv Detail & Related papers (2020-07-02T11:49:17Z)
Fractional moment-preserving initialization schemes for training deep neural networks [1.14219428942199]
A traditional approach to deep neural networks (DNNs) is to sample the network weights randomly for preserving the variance of pre-activations. In this paper, we show that weights and therefore pre-activations can be modeled with a heavy-tailed distribution. We show through numerical experiments that our schemes can improve the training and test performance.
arXiv Detail & Related papers (2020-05-25T01:10:01Z)
Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix. Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.