Related papers: Rethinking Skip Connection with Layer Normalization in Transformers and ResNets

Rethinking Skip Connection with Layer Normalization in Transformers and ResNets

URL: http://arxiv.org/abs/2105.07205v1
Date: Sat, 15 May 2021 11:44:49 GMT
Title: Rethinking Skip Connection with Layer Normalization in Transformers and ResNets
Authors: Fenglin Liu, Xuancheng Ren, Zhiyuan Zhang, Xu Sun, Yuexian Zou
Abstract summary: Skip connection is a widely-used technique to improve the performance of deep neural networks. In this work, we investigate how the scale factors in the effectiveness of the skip connection.
Score: 49.87919454950763
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Skip connection, is a widely-used technique to improve the performance and the convergence of deep neural networks, which is believed to relieve the difficulty in optimization due to non-linearity by propagating a linear component through the neural network layers. However, from another point of view, it can also be seen as a modulating mechanism between the input and the output, with the input scaled by a pre-defined value one. In this work, we investigate how the scale factors in the effectiveness of the skip connection and reveal that a trivial adjustment of the scale will lead to spurious gradient exploding or vanishing in line with the deepness of the models, which could be addressed by normalization, in particular, layer normalization, which induces consistent improvements over the plain skip connection. Inspired by the findings, we further propose to adaptively adjust the scale of the input by recursively applying skip connection with layer normalization, which promotes the performance substantially and generalizes well across diverse tasks including both machine translation and image classification datasets.

Related papers

Concurrent Training and Layer Pruning of Deep Neural Networks [0.0]
We propose an algorithm capable of identifying and eliminating irrelevant layers of a neural network during the early stages of training. We employ a structure using residual connections around nonlinear network sections that allow the flow of information through the network once a nonlinear section is pruned.
arXiv Detail & Related papers (2024-06-06T23:19:57Z)
Normalization-Equivariant Neural Networks with Application to Image Denoising [3.591122855617648]
We propose a methodology for adapting existing neural networks so that normalization-equivariance holds by design. Our main claim is that not only ordinary convolutional layers, but also all activation functions, should be completely removed from neural networks. Experimental results in image denoising show that normalization-equivariant neural networks, in addition to their better conditioning, also provide much better generalization across noise levels.
arXiv Detail & Related papers (2023-06-08T08:42:08Z)
Combining Explicit and Implicit Regularization for Efficient Learning in Deep Networks [3.04585143845864]
In deep linear networks, gradient descent implicitly regularizes toward low-rank solutions on matrix completion/factorization tasks. We propose an explicit penalty to mirror this implicit bias which only takes effect with certain adaptive gradient generalizations. This combination can enable a single-layer network to achieve low-rank approximations with degenerate error comparable to deep linear networks.
arXiv Detail & Related papers (2023-06-01T04:47:17Z)
Predictive coding, precision and natural gradients [2.1601966913620325]
We show that hierarchical predictive coding networks with learnable precision are able to solve various supervised and unsupervised learning tasks. When applied to unsupervised auto-encoding of image inputs, the deterministic network produces hierarchically organized and disentangled embeddings.
arXiv Detail & Related papers (2021-11-12T21:05:03Z)
Non-Gradient Manifold Neural Network [79.44066256794187]
Deep neural network (DNN) generally takes thousands of iterations to optimize via gradient descent. We propose a novel manifold neural network based on non-gradient optimization.
arXiv Detail & Related papers (2021-06-15T06:39:13Z)
Optimizing Mode Connectivity via Neuron Alignment [84.26606622400423]
Empirically, the local minima of loss functions can be connected by a learned curve in model space along which the loss remains nearly constant. We propose a more general framework to investigate effect of symmetry on landscape connectivity by accounting for the weight permutations of networks being connected.
arXiv Detail & Related papers (2020-09-05T02:25:23Z)
Improve Generalization and Robustness of Neural Networks via Weight Scale Shifting Invariant Regularizations [52.493315075385325]
We show that a family of regularizers, including weight decay, is ineffective at penalizing the intrinsic norms of weights for networks with homogeneous activation functions. We propose an improved regularizer that is invariant to weight scale shifting and thus effectively constrains the intrinsic norm of a neural network.
arXiv Detail & Related papers (2020-08-07T02:55:28Z)
Optimization Theory for ReLU Neural Networks Trained with Normalization Layers [82.61117235807606]
The success of deep neural networks in part due to the use of normalization layers. Our analysis shows how the introduction of normalization changes the landscape and can enable faster activation.
arXiv Detail & Related papers (2020-06-11T23:55:54Z)
Beyond Dropout: Feature Map Distortion to Regularize Deep Neural Networks [107.77595511218429]
In this paper, we investigate the empirical Rademacher complexity related to intermediate layers of deep neural networks. We propose a feature distortion method (Disout) for addressing the aforementioned problem. The superiority of the proposed feature map distortion for producing deep neural network with higher testing performance is analyzed and demonstrated.
arXiv Detail & Related papers (2020-02-23T13:59:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.