Addressing Some Limitations of Transformers with Feedback Memory
- URL: http://arxiv.org/abs/2002.09402v3
- Date: Mon, 25 Jan 2021 13:12:00 GMT
- Title: Addressing Some Limitations of Transformers with Feedback Memory
- Authors: Angela Fan, Thibaut Lavril, Edouard Grave, Armand Joulin, Sainbayar
Sukhbaatar
- Abstract summary: Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks.
We propose the Feedback Transformer architecture that exposes all previous representations to all future representations.
We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.
- Score: 51.94640029417114
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have been successfully applied to sequential, auto-regressive
tasks despite being feedforward networks. Unlike recurrent neural networks,
Transformers use attention to capture temporal relations while processing input
tokens in parallel. While this parallelization makes them computationally
efficient, it restricts the model from fully exploiting the sequential nature
of the input. The representation at a given layer can only access
representations from lower layers, rather than the higher level representations
already available. In this work, we propose the Feedback Transformer
architecture that exposes all previous representations to all future
representations, meaning the lowest representation of the current timestep is
formed from the highest-level abstract representation of the past. We
demonstrate on a variety of benchmarks in language modeling, machine
translation, and reinforcement learning that the increased representation
capacity can create small, shallow models with much stronger performance than
comparable Transformers.
Related papers
- Transformers need glasses! Information over-squashing in language tasks [18.81066657470662]
We study how information propagates in decoder-only Transformers.
We show that certain sequences of inputs to the Transformer can yield arbitrarily close representations in the final token.
We also show that decoder-only Transformer language models can lose sensitivity to specific tokens in the input.
arXiv Detail & Related papers (2024-06-06T17:14:44Z) - Mitigating Over-smoothing in Transformers via Regularized Nonlocal
Functionals [31.328766460487355]
We show that self-attention layers in transformers minimize a functional which promotes smoothness, thereby causing token uniformity.
We propose a novel regularizer that penalizes the norm of the difference between the smooth output tokens from self-attention and the input tokens to preserve the fidelity of the tokens.
We empirically demonstrate the advantages of NeuTRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations.
arXiv Detail & Related papers (2023-12-01T17:52:47Z) - iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions.
The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z) - Functional Interpolation for Relative Positions Improves Long Context
Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts.
We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple.
We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z) - Temporal Latent Bottleneck: Synthesis of Fast and Slow Processing
Mechanisms in Sequence Learning [85.95599675484341]
Recurrent neural networks have a strong inductive bias towards learning temporally compressed representations.
Transformers have little inductive bias towards learning temporally compressed representations.
arXiv Detail & Related papers (2022-05-30T00:12:33Z) - Incorporating Convolution Designs into Visual Transformers [24.562955955312187]
We propose a new textbfConvolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies.
Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers.
arXiv Detail & Related papers (2021-03-22T13:16:12Z) - Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one.
With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.