Transformers are RNNs: Fast Autoregressive Transformers with Linear
Attention
- URL: http://arxiv.org/abs/2006.16236v3
- Date: Mon, 31 Aug 2020 11:09:32 GMT
- Title: Transformers are RNNs: Fast Autoregressive Transformers with Linear
Attention
- Authors: Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas and Fran\c{c}ois
Fleuret
- Abstract summary: Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, they are prohibitively slow for very long sequences.
We make use of the associativity property of matrix products to reduce the complexity from $mathcalOleft(N2right)$ to $mathcalOleft(Nright)$, where $N$ is the sequence length.
Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences.
- Score: 22.228028613802174
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers achieve remarkable performance in several tasks but due to their
quadratic complexity, with respect to the input's length, they are
prohibitively slow for very long sequences. To address this limitation, we
express the self-attention as a linear dot-product of kernel feature maps and
make use of the associativity property of matrix products to reduce the
complexity from $\mathcal{O}\left(N^2\right)$ to $\mathcal{O}\left(N\right)$,
where $N$ is the sequence length. We show that this formulation permits an
iterative implementation that dramatically accelerates autoregressive
transformers and reveals their relationship to recurrent neural networks. Our
linear transformers achieve similar performance to vanilla transformers and
they are up to 4000x faster on autoregressive prediction of very long
sequences.
Related papers
- Parallelizing Linear Transformers with the Delta Rule over Sequence Length [49.88826673324244]
This work describes a hardware-efficient algorithm for training linear transformers with the delta rule.
We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines.
arXiv Detail & Related papers (2024-06-10T17:24:42Z) - Towards End-to-End Generative Modeling of Long Videos with
Memory-Efficient Bidirectional Transformers [13.355338760884583]
We propose Memory-directional Bi-efficient Transformer (MeBT) for end-to-end learning of long-term dependency in videos.
Our method learns to decode entire-temporal volume of a video in parallel from partially observed patches.
arXiv Detail & Related papers (2023-03-20T16:35:38Z) - What Dense Graph Do You Need for Self-Attention? [73.82686008622596]
We present Hypercube Transformer, a sparse Transformer that models token interactions in a hypercube and shows comparable or even better results with vanilla Transformer.
Experiments on tasks requiring various sequence lengths lay validation for our graph function well.
arXiv Detail & Related papers (2022-05-27T14:36:55Z) - Linearizing Transformer with Key-Value Memory Bank [54.83663647680612]
We propose MemSizer, an approach to project the source sequence into lower dimension representation.
MemSizer not only achieves the same linear time complexity but also enjoys efficient recurrent-style autoregressive generation.
We demonstrate that MemSizer provides an improved tradeoff between efficiency and accuracy over the vanilla transformer.
arXiv Detail & Related papers (2022-03-23T18:10:18Z) - Block-Recurrent Transformers [49.07682696216708]
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence.
Our recurrent cell operates on blocks of tokens rather than single tokens, and leverages parallel computation within a block in order to make efficient use of accelerator hardware.
arXiv Detail & Related papers (2022-03-11T23:44:33Z) - Sparse is Enough in Scaling Transformers [12.561317511514469]
Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach.
We propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer.
arXiv Detail & Related papers (2021-11-24T19:53:46Z) - Towards Incremental Transformers: An Empirical Analysis of Transformer Models for Incremental NLU [19.103130032967663]
Incremental processing allows interactive systems to respond based on partial inputs.
Recent work attempts to apply Transformers incrementally via restart-incrementality.
This approach is computationally costly and does not scale efficiently for long sequences.
arXiv Detail & Related papers (2021-09-15T15:20:29Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.