Smart Bird: Learnable Sparse Attention for Efficient and Effective
Transformer
- URL: http://arxiv.org/abs/2108.09193v1
- Date: Fri, 20 Aug 2021 14:22:00 GMT
- Title: Smart Bird: Learnable Sparse Attention for Efficient and Effective
Transformer
- Authors: Chuhan Wu, Fangzhao Wu, Tao Qi, Yongfeng Huang
- Abstract summary: We propose Smart Bird, which is an efficient and effective Transformer with learnable sparse attention.
In Smart Bird, we first compute a sketched attention matrix with a single-head low-dimensional Transformer.
We then sample token pairs based on their probability scores derived from the sketched attention matrix to generate different sparse attention index matrices for different attention heads.
- Score: 51.79399904527525
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Transformer has achieved great success in NLP. However, the quadratic
complexity of the self-attention mechanism in Transformer makes it inefficient
in handling long sequences. Many existing works explore to accelerate
Transformers by computing sparse self-attention instead of a dense one, which
usually attends to tokens at certain positions or randomly selected tokens.
However, manually selected or random tokens may be uninformative for context
modeling. In this paper, we propose Smart Bird, which is an efficient and
effective Transformer with learnable sparse attention. In Smart Bird, we first
compute a sketched attention matrix with a single-head low-dimensional
Transformer, which aims to find potential important interactions between
tokens. We then sample token pairs based on their probability scores derived
from the sketched attention matrix to generate different sparse attention index
matrices for different attention heads. Finally, we select token embeddings
according to the index matrices to form the input of sparse attention networks.
Extensive experiments on six benchmark datasets for different tasks validate
the efficiency and effectiveness of Smart Bird in text modeling.
Related papers
- Manifold-Preserving Transformers are Effective for Short-Long Range
Encoding [39.14128923434994]
Multi-head self-attention-based Transformers have shown promise in different learning tasks.
We propose TransJect, an encoder model that guarantees a theoretical bound for layer-wise distance preservation between a pair of tokens.
arXiv Detail & Related papers (2023-10-22T06:58:28Z) - Spike-driven Transformer [31.931401322707995]
Spiking Neural Networks (SNNs) provide an energy-efficient deep learning option due to their unique spike-based event-driven (i.e., spike-driven) paradigm.
In this paper, we incorporate the spike-driven paradigm into Transformer by the proposed Spike-driven Transformer with four unique properties.
It is shown that the Spike-driven Transformer can achieve 77.1% top-1 accuracy on ImageNet-1K, which is the state-of-the-art result in the SNN field.
arXiv Detail & Related papers (2023-07-04T13:00:18Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - How Much Does Attention Actually Attend? Questioning the Importance of
Attention in Pretrained Transformers [59.57128476584361]
We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones.
We find that without any input-dependent attention, all models achieve competitive performance.
We show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.
arXiv Detail & Related papers (2022-11-07T12:37:54Z) - Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for
Long Sequences [16.066338004414092]
textitDiffuser is a new efficient Transformer for sequence-to-sequence modeling.
It incorporates all token interactions within one attention layer while maintaining low computation and memory costs.
We show its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective.
arXiv Detail & Related papers (2022-10-21T08:13:34Z) - Bird-Eye Transformers for Text Generation Models [49.47825106383972]
We propose a new architecture, called bird-eye transformer(BET), which goes one step further to improve the performance of transformers.
Our proposed model achieves a better performance than the baseline transformer architectures onalldatasets.
arXiv Detail & Related papers (2022-10-08T09:51:15Z) - Multi-Tailed Vision Transformer for Efficient Inference [44.43126137573205]
Vision Transformer (ViT) has achieved promising performance in image recognition.
We propose a Multi-Tailed Vision Transformer (MT-ViT) in the paper.
MT-ViT adopts multiple tails to produce visual sequences of different lengths for the following Transformer encoder.
arXiv Detail & Related papers (2022-03-03T09:30:55Z) - Fastformer: Additive Attention Can Be All You Need [51.79399904527525]
We propose Fastformer, which is an efficient Transformer model based on additive attention.
In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts.
In this way, Fastformer can achieve effective context modeling with linear complexity.
arXiv Detail & Related papers (2021-08-20T09:44:44Z) - DynamicViT: Efficient Vision Transformers with Dynamic Token
Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input.
By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%.
DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.