Related papers: Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation

Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation

URL: http://arxiv.org/abs/2302.10322v1
Date: Mon, 20 Feb 2023 21:26:25 GMT
Title: Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation
Authors: Bobby He, James Martens, Guodong Zhang, Aleksandar Botev, Andrew Brock, Samuel L Smith, Yee Whye Teh
Abstract summary: Skip connections and normalisation layers are ubiquitous for the training of Deep Neural Networks (DNNs) Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them. But these approaches are incompatible with the self-attention layers present in transformers.
Score: 105.22961467028234
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Skip connections and normalisation layers form two standard architectural components that are ubiquitous for the training of Deep Neural Networks (DNNs), but whose precise roles are poorly understood. Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them, using insights from wide NN kernel theory to improve signal propagation in vanilla DNNs (which we define as networks without skips or normalisation). However, these approaches are incompatible with the self-attention layers present in transformers, whose kernels are intrinsically more complicated to analyse and control. And so the question remains: is it possible to train deep vanilla transformers? We answer this question in the affirmative by designing several approaches that use combinations of parameter initialisations, bias matrices and location-dependent rescaling to achieve faithful signal propagation in vanilla transformers. Our methods address various intricacies specific to signal propagation in transformers, including the interaction with positional encoding and causal masking. In experiments on WikiText-103 and C4, our approaches enable deep transformers without normalisation to train at speeds matching their standard counterparts, and deep vanilla transformers to reach the same performance as standard ones after about 5 times more iterations.

Related papers

Transformers without Normalization [58.778767721826206]
We introduce Dynamic Tanh (DyT), an element-wise operation $DyT($x$) = tanh(alpha $x$)$, as a drop-in replacement for normalization layers in Transformers. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models.
arXiv Detail & Related papers (2025-03-13T17:59:06Z)
Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot [50.16171384920963]
transformer architecture has prevailed in various deep learning settings. One-layer transformer trained with gradient descent provably learns the sparse token selection task.
arXiv Detail & Related papers (2024-06-11T02:15:53Z)
Simplifying Transformer Blocks [30.451976405521112]
In this work, we ask to what extent the standard transformer block can be simplified? We motivate modifications that allow many block components to be removed with no loss of training speed. In experiments on both autoregressive decoder-only and BERT encoder-only models, our simplified transformers emulate the per-update training speed and performance.
arXiv Detail & Related papers (2023-11-03T13:30:52Z)
Characterization of anomalous diffusion through convolutional transformers [0.8984888893275713]
We propose a new transformer based neural network architecture for the characterization of anomalous diffusion. Our new architecture, the Convolutional Transformer (ConvTransformer), uses a bi-layered convolutional neural network to extract features from our diffusive trajectories. We show that the ConvTransformer is able to outperform the previous state of the art at determining the underlying diffusive regime in short trajectories.
arXiv Detail & Related papers (2022-10-10T18:53:13Z)
Error Correction Code Transformer [92.10654749898927]
We propose to extend for the first time the Transformer architecture to the soft decoding of linear codes at arbitrary block lengths. We encode each channel's output dimension to high dimension for better representation of the bits information to be processed separately. The proposed approach demonstrates the extreme power and flexibility of Transformers and outperforms existing state-of-the-art neural decoders by large margins at a fraction of their time complexity.
arXiv Detail & Related papers (2022-03-27T15:25:58Z)
SepTr: Separable Transformer for Audio Spectrogram Processing [74.41172054754928]
We propose a new vision transformer architecture called Separable Transformer (SepTr) SepTr employs two transformer blocks in a sequential manner, the first attending to tokens within the same frequency bin, and the second attending to tokens within the same time interval. We conduct experiments on three benchmark data sets, showing that our architecture outperforms conventional vision transformers and other state-of-the-art methods.
arXiv Detail & Related papers (2022-03-17T19:48:43Z)
nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution. nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z)
Transformers Solve the Limited Receptive Field for Monocular Depth Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers. This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent [44.44543743806831]
We study the tendency for transformer parameters to grow in magnitude while saturated between these norms during training. As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions. Our results suggest saturation is a new characterization of an inductive bias implicit in GD of particular interest for NLP.
arXiv Detail & Related papers (2020-10-19T17:40:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.