Efficient Content-Based Sparse Attention with Routing Transformers
- URL: http://arxiv.org/abs/2003.05997v5
- Date: Sat, 24 Oct 2020 19:41:17 GMT
- Title: Efficient Content-Based Sparse Attention with Routing Transformers
- Authors: Aurko Roy, Mohammad Saffar, Ashish Vaswani and David Grangier
- Abstract summary: Self-attention suffers from quadratic compute and memory requirements with respect to sequence length.
Our work proposes to learn dynamic sparse attention patterns that avoid allocating and memory to attend to content unrelated to the query of interest.
- Score: 34.83683983648021
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-attention has recently been adopted for a wide range of sequence
modeling problems. Despite its effectiveness, self-attention suffers from
quadratic compute and memory requirements with respect to sequence length.
Successful approaches to reduce this complexity focused on attending to local
sliding windows or a small set of locations independent of content. Our work
proposes to learn dynamic sparse attention patterns that avoid allocating
computation and memory to attend to content unrelated to the query of interest.
This work builds upon two lines of research: it combines the modeling
flexibility of prior work on content-based sparse attention with the efficiency
gains from approaches based on local, temporal sparse attention. Our model, the
Routing Transformer, endows self-attention with a sparse routing module based
on online k-means while reducing the overall complexity of attention to
$O\left(n^{1.5}d\right)$ from $O\left(n^2d\right)$ for sequence length $n$ and
hidden dimension $d$. We show that our model outperforms comparable sparse
attention models on language modeling on Wikitext-103 (15.8 vs 18.3 perplexity)
as well as on image generation on ImageNet-64 (3.43 vs 3.44 bits/dim) while
using fewer self-attention layers. Additionally, we set a new state-of-the-art
on the newly released PG-19 data-set, obtaining a test perplexity of 33.2 with
a 22 layer Routing Transformer model trained on sequences of length 8192.
Related papers
- LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba [54.85262314960038]
Local Attentional Mamba blocks capture both global contexts and local details with linear complexity.
Our model exhibits exceptional scalability and surpasses the performance of DiT across various model scales on ImageNet at 256x256 resolution.
Compared to state-of-the-art diffusion models on ImageNet 256x256 and 512x512, our largest model presents notable advantages, such as a reduction of up to 62% GFLOPs.
arXiv Detail & Related papers (2024-08-05T16:39:39Z) - MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression [22.038650467915176]
We propose the Mixture of Attention (MoA), which automatically tailors distinct sparse attention configurations to different heads and layers.
MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer sequences, while other heads consistently concentrate on fixed-length local contexts.
arXiv Detail & Related papers (2024-06-21T06:58:37Z) - HartleyMHA: Self-Attention in Frequency Domain for Resolution-Robust and
Parameter-Efficient 3D Image Segmentation [4.48473804240016]
We introduce the HartleyMHA model which is robust to training image resolution with efficient self-attention.
We modify the FNO by using the Hartley transform with shared parameters to reduce the model size by orders of magnitude.
When tested on the BraTS'19 dataset, it achieved superior robustness to training image resolution than other tested models with less than 1% of their model parameters.
arXiv Detail & Related papers (2023-10-05T18:44:41Z) - Toeplitz Neural Network for Sequence Modeling [46.04964190407727]
We show that a Toeplitz matrix-vector production trick can reduce the space-time complexity of the sequence modeling to log linear.
A lightweight sub-network called relative position encoder is proposed to generate relative position coefficients with a fixed budget of parameters.
Despite being trained on 512-token sequences, our model can extrapolate input sequence length up to 14K tokens in inference with consistent performance.
arXiv Detail & Related papers (2023-05-08T14:49:01Z) - Efficient Long Sequence Modeling via State Space Augmented Transformer [92.74707853711374]
We propose SPADE, short for $underlinetextbfS$tate sunderlinetextbfP$ace.
We augment a SSM into the bottom layer of SPADE, and we employ efficient local attention methods for the other layers.
Experimental results on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-15T20:51:27Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.