You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli
Sampling
- URL: http://arxiv.org/abs/2111.09714v1
- Date: Thu, 18 Nov 2021 14:24:34 GMT
- Title: You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli
Sampling
- Authors: Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn
Fung, Vikas Singh
- Abstract summary: We show that a Bernoulli sampling attention mechanism based on Locality Sensitive Hashing (LSH) decreases the quadratic complexity of such models to linear.
We evaluate our algorithm on the GLUE benchmark with standard 512 sequence length where we see favorable performance relative to a standard pretrained Transformer.
- Score: 38.34914626128062
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based models are widely used in natural language processing
(NLP). Central to the transformer model is the self-attention mechanism, which
captures the interactions of token pairs in the input sequences and depends
quadratically on the sequence length. Training such models on longer sequences
is expensive. In this paper, we show that a Bernoulli sampling attention
mechanism based on Locality Sensitive Hashing (LSH), decreases the quadratic
complexity of such models to linear. We bypass the quadratic cost by
considering self-attention as a sum of individual tokens associated with
Bernoulli random variables that can, in principle, be sampled at once by a
single hash (although in practice, this number may be a small constant). This
leads to an efficient sampling scheme to estimate self-attention which relies
on specific modifications of LSH (to enable deployment on GPU architectures).
We evaluate our algorithm on the GLUE benchmark with standard 512 sequence
length where we see favorable performance relative to a standard pretrained
Transformer. On the Long Range Arena (LRA) benchmark, for evaluating
performance on long sequences, our method achieves results consistent with
softmax self-attention but with sizable speed-ups and memory savings and often
outperforms other efficient self-attention methods. Our code is available at
https://github.com/mlpen/YOSO
Related papers
- Attention as an RNN [66.5420926480473]
We show that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its textitmany-to-one RNN output efficiently.
We introduce a new efficient method of computing attention's textitmany-to-many RNN output based on the parallel prefix scan algorithm.
We show Aarens achieve comparable performance to Transformers on $38$ datasets spread across four popular sequential problem settings.
arXiv Detail & Related papers (2024-05-22T19:45:01Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - SPEED: Speculative Pipelined Execution for Efficient Decoding [35.45955948053644]
We propose SPEED, which improves inference efficiency by speculatively executing multiple future tokens in parallel with the current token.
For Transformer decoders that employ parameter sharing, the memory operations for the tokens executing in parallel can be amortized.
We demonstrate the efficiency of our method in terms of latency reduction relative to model accuracy and demonstrate how speculation allows for training deeper decoders with parameter sharing with minimal runtime overhead.
arXiv Detail & Related papers (2023-10-18T16:07:01Z) - Toeplitz Neural Network for Sequence Modeling [46.04964190407727]
We show that a Toeplitz matrix-vector production trick can reduce the space-time complexity of the sequence modeling to log linear.
A lightweight sub-network called relative position encoder is proposed to generate relative position coefficients with a fixed budget of parameters.
Despite being trained on 512-token sequences, our model can extrapolate input sequence length up to 14K tokens in inference with consistent performance.
arXiv Detail & Related papers (2023-05-08T14:49:01Z) - Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z) - Pyramid-BERT: Reducing Complexity via Successive Core-set based Token
Selection [23.39962989492527]
Transformer-based language models such as BERT have achieved the state-of-the-art on various NLP tasks, but are computationally prohibitive.
We present Pyramid-BERT where we replace previously useds with a em core-set based token selection method justified by theoretical results.
The core-set based token selection technique allows us to avoid expensive pre-training, gives a space-efficient fine tuning, and thus makes it suitable to handle longer sequence lengths.
arXiv Detail & Related papers (2022-03-27T19:52:01Z) - Sketching as a Tool for Understanding and Accelerating Self-attention
for Long Sequences [52.6022911513076]
Transformer-based models are not efficient in processing long sequences due to the quadratic space and time complexity of the self-attention modules.
We propose Linformer and Informer to reduce the quadratic complexity to linear (modulo logarithmic factors) via low-dimensional projection and row selection.
Based on the theoretical analysis, we propose Skeinformer to accelerate self-attention and further improve the accuracy of matrix approximation to self-attention.
arXiv Detail & Related papers (2021-12-10T06:58:05Z) - Nystr\"omformer: A Nystr\"om-Based Algorithm for Approximating
Self-Attention [60.043273122786005]
We propose Nystr"omformer -- a model that exhibits favorable scalability as a function of sequence length.
The scalability of Nystr"omformer enables application to longer sequences with thousands of tokens.
We perform evaluations on multiple downstream tasks on the GLUE benchmark and reviews with standard sequence length, and find that our Nystr"omformer performs comparably, or in a few cases, even slightly better, than standard Transformer.
arXiv Detail & Related papers (2021-02-07T20:06:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.