Related papers: Linformer: Self-Attention with Linear Complexity

Linformer: Self-Attention with Linear Complexity

URL: http://arxiv.org/abs/2006.04768v3
Date: Sun, 14 Jun 2020 08:15:54 GMT
Title: Linformer: Self-Attention with Linear Complexity
Authors: Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, Hao Ma
Abstract summary: Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. The standard self-attention mechanism of the Transformer uses $O(n2)$ time and space with respect to sequence length. We propose a new self-attention mechanism, which reduces the overall self-attention complexity from $O(n2)$ to $O(n)$ in both time and space.
Score: 36.5703957318311
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences, as the standard self-attention mechanism of the Transformer uses $O(n^2)$ time and space with respect to sequence length. In this paper, we demonstrate that the self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding to propose a new self-attention mechanism, which reduces the overall self-attention complexity from $O(n^2)$ to $O(n)$ in both time and space. The resulting linear transformer, the \textit{Linformer}, performs on par with standard Transformer models, while being much more memory- and time-efficient.

Related papers

Exact Sequence Classification with Hardmax Transformers [0.0]
We prove that hardmax attention transformers perfectly classify datasets of $N$ labeled sequences in $mathbbRd$, $dgeq 2$. Specifically, given $N$ sequences with an arbitrary but finite length in $mathbbRd$, we construct a transformer with $mathcalO(N)$ blocks and $mathcalO(Nd)$ parameters perfectly classifying this dataset.
arXiv Detail & Related papers (2025-02-04T12:31:00Z)
SAMformer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention [14.672072173674039]
We show that transformers are incapable of converging to their true solution despite their high expressive power. We propose a shallow lightweight transformer model that escapes bad local minima when optimized with sharpness-aware optimization. In particular, SAMformer surpasses current state-of-the-art methods and is on par with the biggest foundation model MOIRAI while having significantly fewer parameters.
arXiv Detail & Related papers (2024-02-15T18:55:05Z)
Fourier Transformer: Fast Long Range Modeling by Removing Sequence Redundancy with FFT Operator [24.690247474891958]
Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models. Our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA. For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART.
arXiv Detail & Related papers (2023-05-24T12:33:06Z)
Mega: Moving Average Equipped Gated Attention [150.3124713793503]
Mega is a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average. We show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models.
arXiv Detail & Related papers (2022-09-21T20:52:17Z)
Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity. We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention. An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z)
Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks. It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z)
Revisiting Linformer with a modified self-attention with linear complexity [0.0]
I propose an alternative method for self-attention with linear complexity in time and space. Since this method works for long sequences this can be used for images as well as audios.
arXiv Detail & Related papers (2020-12-16T13:23:29Z)
Long Range Arena: A Benchmark for Efficient Transformers [115.1654897514089]
Long-rangearena benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens. We systematically evaluate ten well-established long-range Transformer models on our newly proposed benchmark suite.
arXiv Detail & Related papers (2020-11-08T15:53:56Z)
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention [22.228028613802174]
Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, they are prohibitively slow for very long sequences. We make use of the associativity property of matrix products to reduce the complexity from $mathcalOleft(N2right)$ to $mathcalOleft(Nright)$, where $N$ is the sequence length. Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences.
arXiv Detail & Related papers (2020-06-29T17:55:38Z)
$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections. We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
Reformer: The Efficient Transformer [21.425616422007543]
We introduce two techniques to improve the efficiency of Transformers. We replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L2$) to O($Llog L$, where $L$ is the length of the sequence. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
arXiv Detail & Related papers (2020-01-13T18:38:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.