Related papers: Reformer: The Efficient Transformer

Reformer: The Efficient Transformer

URL: http://arxiv.org/abs/2001.04451v2
Date: Tue, 18 Feb 2020 16:01:18 GMT
Title: Reformer: The Efficient Transformer
Authors: Nikita Kitaev, {\L}ukasz Kaiser, Anselm Levskaya
Abstract summary: We introduce two techniques to improve the efficiency of Transformers. We replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L2$) to O($Llog L$, where $L$ is the length of the sequence. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
Score: 21.425616422007543
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\log L$), where $L$ is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.

Related papers

Transformers Learn Shortcuts to Automata [52.015990420075944]
We find that a low-depth Transformer can represent the computations of any finite-state automaton. We show that a Transformer with $O(log T)$ layers can exactly replicate the computation of an automaton on an input sequence of length $T$. We further investigate the brittleness of these solutions and propose potential mitigations.
arXiv Detail & Related papers (2022-10-19T17:45:48Z)
Block-Recurrent Transformers [49.07682696216708]
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence. Our recurrent cell operates on blocks of tokens rather than single tokens, and leverages parallel computation within a block in order to make efficient use of accelerator hardware.
arXiv Detail & Related papers (2022-03-11T23:44:33Z)
Sparse is Enough in Scaling Transformers [12.561317511514469]
Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer.
arXiv Detail & Related papers (2021-11-24T19:53:46Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE) Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)
Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. A linear-complexity recurrent variant has proven well suited for autoregressive generation. This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)
Shortformer: Better Language Modeling using Shorter Inputs [62.51758040848735]
We show that initially training the model on short subsequences, before moving on to longer ones, both reduces overall training time. We then show how to improve the efficiency of recurrence methods in transformers.
arXiv Detail & Related papers (2020-12-31T18:52:59Z)
Sub-Linear Memory: How to Make Performers SLiM [38.068090269482425]
vanilla Transformers require $O(L2)$ in serial time and memory as functions of input length $L$. Recent works proposed various linear self-attention mechanisms, scaling only as $O(L)$ for serial computation. We observe a remarkable computational flexibility: forward and backward propagation can be performed with no approximations using sublinear memory.
arXiv Detail & Related papers (2020-12-21T13:56:04Z)
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention [22.228028613802174]
Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, they are prohibitively slow for very long sequences. We make use of the associativity property of matrix products to reduce the complexity from $mathcalOleft(N2right)$ to $mathcalOleft(Nright)$, where $N$ is the sequence length. Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences.
arXiv Detail & Related papers (2020-06-29T17:55:38Z)
$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections. We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
Linformer: Self-Attention with Linear Complexity [36.5703957318311]
Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. The standard self-attention mechanism of the Transformer uses $O(n2)$ time and space with respect to sequence length. We propose a new self-attention mechanism, which reduces the overall self-attention complexity from $O(n2)$ to $O(n)$ in both time and space.
arXiv Detail & Related papers (2020-06-08T17:37:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.