Rethinking Skip Connection with Layer Normalization in Transformers and
ResNets
- URL: http://arxiv.org/abs/2105.07205v1
- Date: Sat, 15 May 2021 11:44:49 GMT
- Title: Rethinking Skip Connection with Layer Normalization in Transformers and
ResNets
- Authors: Fenglin Liu, Xuancheng Ren, Zhiyuan Zhang, Xu Sun, Yuexian Zou
- Abstract summary: Skip connection is a widely-used technique to improve the performance of deep neural networks.
In this work, we investigate how the scale factors in the effectiveness of the skip connection.
- Score: 49.87919454950763
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Skip connection, is a widely-used technique to improve the performance and
the convergence of deep neural networks, which is believed to relieve the
difficulty in optimization due to non-linearity by propagating a linear
component through the neural network layers. However, from another point of
view, it can also be seen as a modulating mechanism between the input and the
output, with the input scaled by a pre-defined value one. In this work, we
investigate how the scale factors in the effectiveness of the skip connection
and reveal that a trivial adjustment of the scale will lead to spurious
gradient exploding or vanishing in line with the deepness of the models, which
could be addressed by normalization, in particular, layer normalization, which
induces consistent improvements over the plain skip connection. Inspired by the
findings, we further propose to adaptively adjust the scale of the input by
recursively applying skip connection with layer normalization, which promotes
the performance substantially and generalizes well across diverse tasks
including both machine translation and image classification datasets.
Related papers
- Concurrent Training and Layer Pruning of Deep Neural Networks [0.0]
We propose an algorithm capable of identifying and eliminating irrelevant layers of a neural network during the early stages of training.
We employ a structure using residual connections around nonlinear network sections that allow the flow of information through the network once a nonlinear section is pruned.
arXiv Detail & Related papers (2024-06-06T23:19:57Z) - Normalization-Equivariant Neural Networks with Application to Image
Denoising [3.591122855617648]
We propose a methodology for adapting existing neural networks so that normalization-equivariance holds by design.
Our main claim is that not only ordinary convolutional layers, but also all activation functions, should be completely removed from neural networks.
Experimental results in image denoising show that normalization-equivariant neural networks, in addition to their better conditioning, also provide much better generalization across noise levels.
arXiv Detail & Related papers (2023-06-08T08:42:08Z) - Combining Explicit and Implicit Regularization for Efficient Learning in
Deep Networks [3.04585143845864]
In deep linear networks, gradient descent implicitly regularizes toward low-rank solutions on matrix completion/factorization tasks.
We propose an explicit penalty to mirror this implicit bias which only takes effect with certain adaptive gradient generalizations.
This combination can enable a single-layer network to achieve low-rank approximations with degenerate error comparable to deep linear networks.
arXiv Detail & Related papers (2023-06-01T04:47:17Z) - Predictive coding, precision and natural gradients [2.1601966913620325]
We show that hierarchical predictive coding networks with learnable precision are able to solve various supervised and unsupervised learning tasks.
When applied to unsupervised auto-encoding of image inputs, the deterministic network produces hierarchically organized and disentangled embeddings.
arXiv Detail & Related papers (2021-11-12T21:05:03Z) - Non-Gradient Manifold Neural Network [79.44066256794187]
Deep neural network (DNN) generally takes thousands of iterations to optimize via gradient descent.
We propose a novel manifold neural network based on non-gradient optimization.
arXiv Detail & Related papers (2021-06-15T06:39:13Z) - Optimizing Mode Connectivity via Neuron Alignment [84.26606622400423]
Empirically, the local minima of loss functions can be connected by a learned curve in model space along which the loss remains nearly constant.
We propose a more general framework to investigate effect of symmetry on landscape connectivity by accounting for the weight permutations of networks being connected.
arXiv Detail & Related papers (2020-09-05T02:25:23Z) - Improve Generalization and Robustness of Neural Networks via Weight
Scale Shifting Invariant Regularizations [52.493315075385325]
We show that a family of regularizers, including weight decay, is ineffective at penalizing the intrinsic norms of weights for networks with homogeneous activation functions.
We propose an improved regularizer that is invariant to weight scale shifting and thus effectively constrains the intrinsic norm of a neural network.
arXiv Detail & Related papers (2020-08-07T02:55:28Z) - Optimization Theory for ReLU Neural Networks Trained with Normalization
Layers [82.61117235807606]
The success of deep neural networks in part due to the use of normalization layers.
Our analysis shows how the introduction of normalization changes the landscape and can enable faster activation.
arXiv Detail & Related papers (2020-06-11T23:55:54Z) - Beyond Dropout: Feature Map Distortion to Regularize Deep Neural
Networks [107.77595511218429]
In this paper, we investigate the empirical Rademacher complexity related to intermediate layers of deep neural networks.
We propose a feature distortion method (Disout) for addressing the aforementioned problem.
The superiority of the proposed feature map distortion for producing deep neural network with higher testing performance is analyzed and demonstrated.
arXiv Detail & Related papers (2020-02-23T13:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.