Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding
- URL: http://arxiv.org/abs/2106.12566v1
- Date: Wed, 23 Jun 2021 17:51:26 GMT
- Title: Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding
- Authors: Shengjie Luo, Shanda Li, Tianle Cai, Di He, Dinglan Peng, Shuxin
Zheng, Guolin Ke, Liwei Wang, Tie-Yan Liu
- Abstract summary: We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
- Score: 63.539333383965726
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The attention module, which is a crucial component in Transformer, cannot
scale efficiently to long sequences due to its quadratic complexity. Many works
focus on approximating the dot-then-exponentiate softmax function in the
original attention, leading to sub-quadratic or even linear-complexity
Transformer architectures. However, we show that these methods cannot be
applied to more powerful attention modules that go beyond the
dot-then-exponentiate style, e.g., Transformers with relative positional
encoding (RPE). Since in many state-of-the-art models, relative positional
encoding is used as default, designing efficient Transformers that can
incorporate RPE is appealing. In this paper, we propose a novel way to
accelerate attention calculation for Transformers with RPE on top of the
kernelized attention. Based upon the observation that relative positional
encoding forms a Toeplitz matrix, we mathematically show that kernelized
attention with RPE can be calculated efficiently using Fast Fourier Transform
(FFT). With FFT, our method achieves $\mathcal{O}(n\log n)$ time complexity.
Interestingly, we further demonstrate that properly using relative positional
encoding can mitigate the training instability problem of vanilla kernelized
attention. On a wide range of tasks, we empirically show that our models can be
trained from scratch without any optimization issues. The learned model
performs better than many efficient Transformer variants and is faster than
standard Transformer in the long-sequence regime.
Related papers
- PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting [82.03373838627606]
Self-attention mechanism in Transformer architecture requires positional embeddings to encode temporal order in time series prediction.
We argue that this reliance on positional embeddings restricts the Transformer's ability to effectively represent temporal sequences.
We present a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets.
arXiv Detail & Related papers (2024-08-20T01:56:07Z) - Functional Interpolation for Relative Positions Improves Long Context
Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts.
We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple.
We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z) - Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions.
We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is.
We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z) - FastRPB: a Scalable Relative Positional Encoding for Long Sequence Tasks [0.2538209532048867]
We introduce FastRPB, which efficiently adds positional information to self-attention.
FastRPB has O(N log(N)) computational complexity, requiring O(N) memory w.r.t. input sequence length N.
arXiv Detail & Related papers (2022-02-23T09:12:00Z) - Transformer Acceleration with Dynamic Sparse Attention [20.758709319088865]
We propose the Dynamic Sparse Attention (DSA) that can efficiently exploit the dynamic sparsity in the attention of Transformers.
Our approach can achieve better trade-offs between accuracy and model complexity.
arXiv Detail & Related papers (2021-10-21T17:31:57Z) - Fastformer: Additive Attention Can Be All You Need [51.79399904527525]
We propose Fastformer, which is an efficient Transformer model based on additive attention.
In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts.
In this way, Fastformer can achieve effective context modeling with linear complexity.
arXiv Detail & Related papers (2021-08-20T09:44:44Z) - Relative Positional Encoding for Transformers with Linear Complexity [30.48367640796256]
relative positional encoding (RPE) was proposed as beneficial for classical Transformers.
RPE is not available for the recent linear-variants of the Transformer, because it requires the explicit computation of the attention matrix.
In this paper, we present precisely what is precisely what is a way to generate PE that can be used as a replacement to the classical additive (sinusoidal) PE and provably behaves like RPE.
arXiv Detail & Related papers (2021-05-18T09:52:32Z) - FNet: Mixing Tokens with Fourier Transforms [0.578717214982749]
We show that Transformer encoder architectures can be massively sped up with limited accuracy costs.
We replace the self-attention sublayers with simple linear transformations that "mix" input tokens.
The resulting model, which we name FNet, scales very efficiently to long inputs.
arXiv Detail & Related papers (2021-05-09T03:32:48Z) - Random Feature Attention [69.4671822971207]
We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function.
RFA can be used as a drop-in replacement for conventional softmax attention and offers a straightforward way of learning with recency bias through an optional gating mechanism.
Experiments on language modeling and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines.
arXiv Detail & Related papers (2021-03-03T02:48:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.