FastRPB: a Scalable Relative Positional Encoding for Long Sequence Tasks
- URL: http://arxiv.org/abs/2202.11364v1
- Date: Wed, 23 Feb 2022 09:12:00 GMT
- Title: FastRPB: a Scalable Relative Positional Encoding for Long Sequence Tasks
- Authors: Maksim Zubkov, Daniil Gavrilov
- Abstract summary: We introduce FastRPB, which efficiently adds positional information to self-attention.
FastRPB has O(N log(N)) computational complexity, requiring O(N) memory w.r.t. input sequence length N.
- Score: 0.2538209532048867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers achieve remarkable performance in various domains, including
NLP, CV, audio processing, and graph analysis. However, they do not scale well
on long sequence tasks due to their quadratic complexity w.r.t. the inputs
length. Linear Transformers were proposed to address this limitation. However,
these models have shown weaker performance on the long sequence tasks comparing
to the original one. In this paper, we explore Linear Transformer models,
rethinking their two core components. Firstly, we improved Linear Transformer
with Shift-Invariant Kernel Function SIKF, which achieve higher accuracy
without loss in speed. Secondly, we introduce FastRPB which stands for Fast
Relative Positional Bias, which efficiently adds positional information to
self-attention using Fast Fourier Transformation. FastRPB is independent of the
self-attention mechanism and can be combined with an original self-attention
and all its efficient variants. FastRPB has O(N log(N)) computational
complexity, requiring O(N) memory w.r.t. input sequence length N.
Related papers
- A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA
Through Sparse Attention and Dynamic Pipelining [28.336502115532905]
This paper proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration.
We develop a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm.
Our design has very small accuracy loss and has 80.2 $times$ and 2.6 $times$ speedup compared to CPU and GPU implementation.
arXiv Detail & Related papers (2022-08-07T05:48:38Z) - Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z) - What Dense Graph Do You Need for Self-Attention? [73.82686008622596]
We present Hypercube Transformer, a sparse Transformer that models token interactions in a hypercube and shows comparable or even better results with vanilla Transformer.
Experiments on tasks requiring various sequence lengths lay validation for our graph function well.
arXiv Detail & Related papers (2022-05-27T14:36:55Z) - Linearizing Transformer with Key-Value Memory Bank [54.83663647680612]
We propose MemSizer, an approach to project the source sequence into lower dimension representation.
MemSizer not only achieves the same linear time complexity but also enjoys efficient recurrent-style autoregressive generation.
We demonstrate that MemSizer provides an improved tradeoff between efficiency and accuracy over the vanilla transformer.
arXiv Detail & Related papers (2022-03-23T18:10:18Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - FNet: Mixing Tokens with Fourier Transforms [0.578717214982749]
We show that Transformer encoder architectures can be massively sped up with limited accuracy costs.
We replace the self-attention sublayers with simple linear transformations that "mix" input tokens.
The resulting model, which we name FNet, scales very efficiently to long inputs.
arXiv Detail & Related papers (2021-05-09T03:32:48Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - Random Feature Attention [69.4671822971207]
We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function.
RFA can be used as a drop-in replacement for conventional softmax attention and offers a straightforward way of learning with recency bias through an optional gating mechanism.
Experiments on language modeling and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines.
arXiv Detail & Related papers (2021-03-03T02:48:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.