Combiner: Full Attention Transformer with Sparse Computation Cost
- URL: http://arxiv.org/abs/2107.05768v1
- Date: Mon, 12 Jul 2021 22:43:11 GMT
- Title: Combiner: Full Attention Transformer with Sparse Computation Cost
- Authors: Hongyu Ren, Hanjun Dai, Zihang Dai, Mengjiao Yang, Jure Leskovec, Dale
Schuurmans, Bo Dai
- Abstract summary: We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity.
We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
- Score: 142.10203598824964
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers provide a class of expressive architectures that are extremely
effective for sequence modeling. However, the key limitation of transformers is
their quadratic memory and time complexity $\mathcal{O}(L^2)$ with respect to
the sequence length in attention layers, which restricts application in
extremely long sequences. Most existing approaches leverage sparsity or
low-rank assumptions in the attention matrix to reduce cost, but sacrifice
expressiveness. Instead, we propose Combiner, which provides full attention
capability in each attention head while maintaining low computation and memory
complexity. The key idea is to treat the self-attention mechanism as a
conditional expectation over embeddings at each location, and approximate the
conditional distribution with a structured factorization. Each location can
attend to all other locations, either via direct attention, or through indirect
attention to abstractions, which are again conditional expectations of
embeddings from corresponding local regions. We show that most sparse attention
patterns used in existing sparse transformers are able to inspire the design of
such factorization for full attention, resulting in the same sub-quadratic cost
($\mathcal{O}(L\log(L))$ or $\mathcal{O}(L\sqrt{L})$). Combiner is a drop-in
replacement for attention layers in existing transformers and can be easily
implemented in common frameworks. An experimental evaluation on both
autoregressive and bidirectional sequence tasks demonstrates the effectiveness
of this approach, yielding state-of-the-art results on several image and text
modeling tasks.
Related papers
- Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model [10.473819332984005]
We propose a segmented recurrent transformer (SRformer) that combines segmented (local) attention with recurrent attention.
The proposed model achieves $6-22%$ higher ROUGE1 scores than a segmented transformer and outperforms other recurrent transformer approaches.
arXiv Detail & Related papers (2023-05-24T03:47:22Z) - Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for
Long Sequences [16.066338004414092]
textitDiffuser is a new efficient Transformer for sequence-to-sequence modeling.
It incorporates all token interactions within one attention layer while maintaining low computation and memory costs.
We show its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective.
arXiv Detail & Related papers (2022-10-21T08:13:34Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Predicting Attention Sparsity in Transformers [0.9786690381850356]
We propose Sparsefinder, a model trained to identify the sparsity pattern of entmax attention before computing it.
Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph.
arXiv Detail & Related papers (2021-09-24T20:51:21Z) - Cluster-Former: Clustering-based Sparse Transformer for Long-Range
Dependency Encoding [90.77031668988661]
Cluster-Former is a novel clustering-based sparse Transformer to perform attention across chunked sequences.
The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer.
Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.
arXiv Detail & Related papers (2020-09-13T22:09:30Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.