Deep Transformers without Shortcuts: Modifying Self-attention for
Faithful Signal Propagation
- URL: http://arxiv.org/abs/2302.10322v1
- Date: Mon, 20 Feb 2023 21:26:25 GMT
- Title: Deep Transformers without Shortcuts: Modifying Self-attention for
Faithful Signal Propagation
- Authors: Bobby He, James Martens, Guodong Zhang, Aleksandar Botev, Andrew
Brock, Samuel L Smith, Yee Whye Teh
- Abstract summary: Skip connections and normalisation layers are ubiquitous for the training of Deep Neural Networks (DNNs)
Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them.
But these approaches are incompatible with the self-attention layers present in transformers.
- Score: 105.22961467028234
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Skip connections and normalisation layers form two standard architectural
components that are ubiquitous for the training of Deep Neural Networks (DNNs),
but whose precise roles are poorly understood. Recent approaches such as Deep
Kernel Shaping have made progress towards reducing our reliance on them, using
insights from wide NN kernel theory to improve signal propagation in vanilla
DNNs (which we define as networks without skips or normalisation). However,
these approaches are incompatible with the self-attention layers present in
transformers, whose kernels are intrinsically more complicated to analyse and
control. And so the question remains: is it possible to train deep vanilla
transformers? We answer this question in the affirmative by designing several
approaches that use combinations of parameter initialisations, bias matrices
and location-dependent rescaling to achieve faithful signal propagation in
vanilla transformers. Our methods address various intricacies specific to
signal propagation in transformers, including the interaction with positional
encoding and causal masking. In experiments on WikiText-103 and C4, our
approaches enable deep transformers without normalisation to train at speeds
matching their standard counterparts, and deep vanilla transformers to reach
the same performance as standard ones after about 5 times more iterations.
Related papers
- Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot [50.16171384920963]
transformer architecture has prevailed in various deep learning settings.
One-layer transformer trained with gradient descent provably learns the sparse token selection task.
arXiv Detail & Related papers (2024-06-11T02:15:53Z) - Simplifying Transformer Blocks [30.451976405521112]
In this work, we ask to what extent the standard transformer block can be simplified?
We motivate modifications that allow many block components to be removed with no loss of training speed.
In experiments on both autoregressive decoder-only and BERT encoder-only models, our simplified transformers emulate the per-update training speed and performance.
arXiv Detail & Related papers (2023-11-03T13:30:52Z) - Characterization of anomalous diffusion through convolutional
transformers [0.8984888893275713]
We propose a new transformer based neural network architecture for the characterization of anomalous diffusion.
Our new architecture, the Convolutional Transformer (ConvTransformer), uses a bi-layered convolutional neural network to extract features from our diffusive trajectories.
We show that the ConvTransformer is able to outperform the previous state of the art at determining the underlying diffusive regime in short trajectories.
arXiv Detail & Related papers (2022-10-10T18:53:13Z) - Error Correction Code Transformer [92.10654749898927]
We propose to extend for the first time the Transformer architecture to the soft decoding of linear codes at arbitrary block lengths.
We encode each channel's output dimension to high dimension for better representation of the bits information to be processed separately.
The proposed approach demonstrates the extreme power and flexibility of Transformers and outperforms existing state-of-the-art neural decoders by large margins at a fraction of their time complexity.
arXiv Detail & Related papers (2022-03-27T15:25:58Z) - SepTr: Separable Transformer for Audio Spectrogram Processing [74.41172054754928]
We propose a new vision transformer architecture called Separable Transformer (SepTr)
SepTr employs two transformer blocks in a sequential manner, the first attending to tokens within the same frequency bin, and the second attending to tokens within the same time interval.
We conduct experiments on three benchmark data sets, showing that our architecture outperforms conventional vision transformers and other state-of-the-art methods.
arXiv Detail & Related papers (2022-03-17T19:48:43Z) - nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution.
nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Effects of Parameter Norm Growth During Transformer Training: Inductive
Bias from Gradient Descent [44.44543743806831]
We study the tendency for transformer parameters to grow in magnitude while saturated between these norms during training.
As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions.
Our results suggest saturation is a new characterization of an inductive bias implicit in GD of particular interest for NLP.
arXiv Detail & Related papers (2020-10-19T17:40:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.