Related papers: Addressing Some Limitations of Transformers with Feedback Memory

Addressing Some Limitations of Transformers with Feedback Memory

URL: http://arxiv.org/abs/2002.09402v3
Date: Mon, 25 Jan 2021 13:12:00 GMT
Title: Addressing Some Limitations of Transformers with Feedback Memory
Authors: Angela Fan, Thibaut Lavril, Edouard Grave, Armand Joulin, Sainbayar Sukhbaatar
Abstract summary: Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks. We propose the Feedback Transformer architecture that exposes all previous representations to all future representations. We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.
Score: 51.94640029417114
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks. Unlike recurrent neural networks, Transformers use attention to capture temporal relations while processing input tokens in parallel. While this parallelization makes them computationally efficient, it restricts the model from fully exploiting the sequential nature of the input. The representation at a given layer can only access representations from lower layers, rather than the higher level representations already available. In this work, we propose the Feedback Transformer architecture that exposes all previous representations to all future representations, meaning the lowest representation of the current timestep is formed from the highest-level abstract representation of the past. We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.

Related papers

RingFormer: Rethinking Recurrent Transformer with Adaptive Level Signals [2.287772422489548]
We propose RingFormer, which employs one Transformer layer that processes input repeatedly in a circular, ring-like manner. This allows us to reduce the model parameters substantially while maintaining high performance in a variety of tasks such as translation and image classification.
arXiv Detail & Related papers (2025-02-18T09:34:31Z)
Transformers need glasses! Information over-squashing in language tasks [18.81066657470662]
We study how information propagates in decoder-only Transformers. We show that certain sequences of inputs to the Transformer can yield arbitrarily close representations in the final token. We also show that decoder-only Transformer language models can lose sensitivity to specific tokens in the input.
arXiv Detail & Related papers (2024-06-06T17:14:44Z)
Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals [31.328766460487355]
We show that self-attention layers in transformers minimize a functional which promotes smoothness, thereby causing token uniformity. We propose a novel regularizer that penalizes the norm of the difference between the smooth output tokens from self-attention and the input tokens to preserve the fidelity of the tokens. We empirically demonstrate the advantages of NeuTRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations.
arXiv Detail & Related papers (2023-12-01T17:52:47Z)
iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions. The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z)
Functional Interpolation for Relative Positions Improves Long Context Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts. We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple. We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z)
Temporal Latent Bottleneck: Synthesis of Fast and Slow Processing Mechanisms in Sequence Learning [85.95599675484341]
Recurrent neural networks have a strong inductive bias towards learning temporally compressed representations. Transformers have little inductive bias towards learning temporally compressed representations.
arXiv Detail & Related papers (2022-05-30T00:12:33Z)
Incorporating Convolution Designs into Visual Transformers [24.562955955312187]
We propose a new textbfConvolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies. Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers.
arXiv Detail & Related papers (2021-03-22T13:16:12Z)
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one. With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z)
Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks. We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.