Nystr\"omformer: A Nystr\"om-Based Algorithm for Approximating
Self-Attention
- URL: http://arxiv.org/abs/2102.03902v1
- Date: Sun, 7 Feb 2021 20:06:59 GMT
- Title: Nystr\"omformer: A Nystr\"om-Based Algorithm for Approximating
Self-Attention
- Authors: Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan,
Glenn Fung, Yin Li, Vikas Singh
- Abstract summary: We propose Nystr"omformer -- a model that exhibits favorable scalability as a function of sequence length.
The scalability of Nystr"omformer enables application to longer sequences with thousands of tokens.
We perform evaluations on multiple downstream tasks on the GLUE benchmark and reviews with standard sequence length, and find that our Nystr"omformer performs comparably, or in a few cases, even slightly better, than standard Transformer.
- Score: 60.043273122786005
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have emerged as a powerful tool for a broad range of natural
language processing tasks. A key component that drives the impressive
performance of Transformers is the self-attention mechanism that encodes the
influence or dependence of other tokens on each specific token. While
beneficial, the quadratic complexity of self-attention on the input sequence
length has limited its application to longer sequences -- a topic being
actively studied in the community. To address this limitation, we propose
Nystr\"omformer -- a model that exhibits favorable scalability as a function of
sequence length. Our idea is based on adapting the Nystr\"om method to
approximate standard self-attention with $O(n)$ complexity. The scalability of
Nystr\"omformer enables application to longer sequences with thousands of
tokens. We perform evaluations on multiple downstream tasks on the GLUE
benchmark and IMDB reviews with standard sequence length, and find that our
Nystr\"omformer performs comparably, or in a few cases, even slightly better,
than standard Transformer. Our code is at
https://github.com/mlpen/Nystromformer.
Related papers
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module.
We identify that a key weakness of such models is their inability to perform content-based reasoning.
We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba)
As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z) - Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices.
Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
arXiv Detail & Related papers (2023-10-03T08:44:50Z) - LongNet: Scaling Transformers to 1,000,000,000 Tokens [146.4077038371075]
LongNet is a Transformer variant that can scale sequence length to more than 1 billion tokens.
Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.
arXiv Detail & Related papers (2023-07-05T17:59:38Z) - Toeplitz Neural Network for Sequence Modeling [46.04964190407727]
We show that a Toeplitz matrix-vector production trick can reduce the space-time complexity of the sequence modeling to log linear.
A lightweight sub-network called relative position encoder is proposed to generate relative position coefficients with a fixed budget of parameters.
Despite being trained on 512-token sequences, our model can extrapolate input sequence length up to 14K tokens in inference with consistent performance.
arXiv Detail & Related papers (2023-05-08T14:49:01Z) - Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing
Important Tokens [65.4435926060951]
We propose to significantly improve the efficiency of Transformers for ultra long sequences, by compressing the sequence into a much smaller representation at each layer.
Our algorithm is not only efficient (achieving more than $3times$ efficiency gain compared to baselines on 4K and 16K lengths) but also offers competitive/better performance on a large number of tasks.
arXiv Detail & Related papers (2023-05-07T10:32:18Z) - You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli
Sampling [38.34914626128062]
We show that a Bernoulli sampling attention mechanism based on Locality Sensitive Hashing (LSH) decreases the quadratic complexity of such models to linear.
We evaluate our algorithm on the GLUE benchmark with standard 512 sequence length where we see favorable performance relative to a standard pretrained Transformer.
arXiv Detail & Related papers (2021-11-18T14:24:34Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z) - Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one.
With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.