Linformer: Self-Attention with Linear Complexity
- URL: http://arxiv.org/abs/2006.04768v3
- Date: Sun, 14 Jun 2020 08:15:54 GMT
- Title: Linformer: Self-Attention with Linear Complexity
- Authors: Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, Hao Ma
- Abstract summary: Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications.
The standard self-attention mechanism of the Transformer uses $O(n2)$ time and space with respect to sequence length.
We propose a new self-attention mechanism, which reduces the overall self-attention complexity from $O(n2)$ to $O(n)$ in both time and space.
- Score: 36.5703957318311
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large transformer models have shown extraordinary success in achieving
state-of-the-art results in many natural language processing applications.
However, training and deploying these models can be prohibitively costly for
long sequences, as the standard self-attention mechanism of the Transformer
uses $O(n^2)$ time and space with respect to sequence length. In this paper, we
demonstrate that the self-attention mechanism can be approximated by a low-rank
matrix. We further exploit this finding to propose a new self-attention
mechanism, which reduces the overall self-attention complexity from $O(n^2)$ to
$O(n)$ in both time and space. The resulting linear transformer, the
\textit{Linformer}, performs on par with standard Transformer models, while
being much more memory- and time-efficient.
Related papers
- SAMformer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention [14.672072173674039]
We show that transformers are incapable of converging to their true solution despite their high expressive power.
We propose a shallow lightweight transformer model that escapes bad local minima when optimized with sharpness-aware optimization.
In particular, SAMformer surpasses current state-of-the-art methods and is on par with the biggest foundation model MOIRAI while having significantly fewer parameters.
arXiv Detail & Related papers (2024-02-15T18:55:05Z) - Fourier Transformer: Fast Long Range Modeling by Removing Sequence
Redundancy with FFT Operator [24.690247474891958]
Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models.
Our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA.
For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART.
arXiv Detail & Related papers (2023-05-24T12:33:06Z) - Mega: Moving Average Equipped Gated Attention [150.3124713793503]
Mega is a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average.
We show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models.
arXiv Detail & Related papers (2022-09-21T20:52:17Z) - Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity.
We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z) - Revisiting Linformer with a modified self-attention with linear
complexity [0.0]
I propose an alternative method for self-attention with linear complexity in time and space.
Since this method works for long sequences this can be used for images as well as audios.
arXiv Detail & Related papers (2020-12-16T13:23:29Z) - Long Range Arena: A Benchmark for Efficient Transformers [115.1654897514089]
Long-rangearena benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens.
We systematically evaluate ten well-established long-range Transformer models on our newly proposed benchmark suite.
arXiv Detail & Related papers (2020-11-08T15:53:56Z) - Transformers are RNNs: Fast Autoregressive Transformers with Linear
Attention [22.228028613802174]
Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, they are prohibitively slow for very long sequences.
We make use of the associativity property of matrix products to reduce the complexity from $mathcalOleft(N2right)$ to $mathcalOleft(Nright)$, where $N$ is the sequence length.
Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences.
arXiv Detail & Related papers (2020-06-29T17:55:38Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z) - Reformer: The Efficient Transformer [21.425616422007543]
We introduce two techniques to improve the efficiency of Transformers.
We replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L2$) to O($Llog L$, where $L$ is the length of the sequence.
The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
arXiv Detail & Related papers (2020-01-13T18:38:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.