Rethinking Attention with Performers
- URL: http://arxiv.org/abs/2009.14794v3
- Date: Tue, 9 Mar 2021 16:26:47 GMT
- Title: Rethinking Attention with Performers
- Authors: Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou
Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz
Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller
- Abstract summary: We introduce Performers, Transformer architectures which can estimate full-rank-attention Transformers with provable accuracy.
Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods.
We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
- Score: 45.47365397101224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Performers, Transformer architectures which can estimate regular
(softmax) full-rank-attention Transformers with provable accuracy, but using
only linear (as opposed to quadratic) space and time complexity, without
relying on any priors such as sparsity or low-rankness. To approximate softmax
attention-kernels, Performers use a novel Fast Attention Via positive
Orthogonal Random features approach (FAVOR+), which may be of independent
interest for scalable kernel methods. FAVOR+ can be also used to efficiently
model kernelizable attention mechanisms beyond softmax. This representational
power is crucial to accurately compare softmax with other kernels for the first
time on large-scale tasks, beyond the reach of regular Transformers, and
investigate optimal attention-kernels. Performers are linear architectures
fully compatible with regular Transformers and with strong theoretical
guarantees: unbiased or nearly-unbiased estimation of the attention matrix,
uniform convergence and low estimation variance. We tested Performers on a rich
set of tasks stretching from pixel-prediction through text models to protein
sequence modeling. We demonstrate competitive results with other examined
efficient sparse and dense attention methods, showcasing effectiveness of the
novel attention-learning paradigm leveraged by Performers.
Related papers
- Theory, Analysis, and Best Practices for Sigmoid Self-Attention [16.73166377436999]
We revisit sigmoid attention and conduct an in-depth theoretical and empirical analysis.
We prove that transformers with sigmoid attention are universal function approximators.
We introduce FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid attention.
arXiv Detail & Related papers (2024-09-06T17:53:26Z) - FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification.
Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z) - Transformers meet Stochastic Block Models: Attention with Data-Adaptive
Sparsity and Cost [53.746169882193456]
Recent works have proposed various sparse attention modules to overcome the quadratic cost of self-attention.
We propose a model that resolves both problems by endowing each attention head with a mixed-membership Block Model.
Our model outperforms previous efficient variants as well as the original Transformer with full attention.
arXiv Detail & Related papers (2022-10-27T15:30:52Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - SOFT: Softmax-free Transformer with Linear Complexity [112.9754491864247]
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention.
Various attempts on approximating the self-attention with linear complexity have been made in Natural Language Processing.
We identify that their limitations are rooted in keeping the softmax self-attention during approximations.
For the first time, a softmax-free transformer or SOFT is proposed.
arXiv Detail & Related papers (2021-10-22T17:57:29Z) - Predicting Attention Sparsity in Transformers [0.9786690381850356]
We propose Sparsefinder, a model trained to identify the sparsity pattern of entmax attention before computing it.
Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph.
arXiv Detail & Related papers (2021-09-24T20:51:21Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - Random Feature Attention [69.4671822971207]
We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function.
RFA can be used as a drop-in replacement for conventional softmax attention and offers a straightforward way of learning with recency bias through an optional gating mechanism.
Experiments on language modeling and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines.
arXiv Detail & Related papers (2021-03-03T02:48:56Z) - Masked Language Modeling for Proteins via Linearly Scalable Long-Context
Transformers [42.93754828584075]
We present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR)
Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors.
It provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence.
arXiv Detail & Related papers (2020-06-05T17:09:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.