Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient
Vision Transformers
- URL: http://arxiv.org/abs/2303.13755v1
- Date: Fri, 24 Mar 2023 02:12:28 GMT
- Title: Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient
Vision Transformers
- Authors: Cong Wei and Brendan Duke and Ruowei Jiang and Parham Aarabi and
Graham W. Taylor and Florian Shkurti
- Abstract summary: Vision Transformers (ViT) have shown their competitive advantages performance-wise compared to convolutional neural networks (CNNs)
We propose a novel approach to learn instance-dependent attention patterns, by devising a lightweight connectivity predictor module.
We show that our method reduces 48% to 69% FLOPs of MHSA while the accuracy drop is within 0.4%.
- Score: 34.19166698049552
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers (ViT) have shown their competitive advantages
performance-wise compared to convolutional neural networks (CNNs) though they
often come with high computational costs. To this end, previous methods explore
different attention patterns by limiting a fixed number of spatially nearby
tokens to accelerate the ViT's multi-head self-attention (MHSA) operations.
However, such structured attention patterns limit the token-to-token
connections to their spatial relevance, which disregards learned semantic
connections from a full attention mask. In this work, we propose a novel
approach to learn instance-dependent attention patterns, by devising a
lightweight connectivity predictor module to estimate the connectivity score of
each pair of tokens. Intuitively, two tokens have high connectivity scores if
the features are considered relevant either spatially or semantically. As each
token only attends to a small number of other tokens, the binarized
connectivity masks are often very sparse by nature and therefore provide the
opportunity to accelerate the network via sparse computations. Equipped with
the learned unstructured attention pattern, sparse attention ViT (Sparsifiner)
produces a superior Pareto-optimal trade-off between FLOPs and top-1 accuracy
on ImageNet compared to token sparsity. Our method reduces 48% to 69% FLOPs of
MHSA while the accuracy drop is within 0.4%. We also show that combining
attention and token sparsity reduces ViT FLOPs by over 60%.
Related papers
- ToSA: Token Selective Attention for Efficient Vision Transformers [50.13756218204456]
ToSA is a token selective attention approach that can identify tokens that need to be attended as well as those that can skip a transformer layer.
We show that ToSA can significantly reduce computation costs while maintaining accuracy on the ImageNet classification benchmark.
arXiv Detail & Related papers (2024-06-13T05:17:21Z) - AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance.
We propose to apply adaptive resolution for different regions in the image according to their importance.
We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z) - Learning Feature Matching via Matchable Keypoint-Assisted Graph Neural
Network [52.29330138835208]
Accurately matching local features between a pair of images is a challenging computer vision task.
Previous studies typically use attention based graph neural networks (GNNs) with fully-connected graphs over keypoints within/across images.
We propose MaKeGNN, a sparse attention-based GNN architecture which bypasses non-repeatable keypoints and leverages matchable ones to guide message passing.
arXiv Detail & Related papers (2023-07-04T02:50:44Z) - Multi-Scale And Token Mergence: Make Your ViT More Efficient [3.087140219508349]
Vision Transformer (ViT) has emerged as a prevalent model in the computer vision domain.
We propose a novel token pruning method that retains information from non-crucial tokens by merging them with more crucial tokens.
Our method achieves a remarkable 33% reduction in computational costs while only incurring a 0.1% decrease in accuracy on DeiT-S.
arXiv Detail & Related papers (2023-06-08T02:58:15Z) - Robustifying Token Attention for Vision Transformers [72.07710236246285]
Vision transformers (ViTs) still suffer from significant drops in accuracy in the presence of common corruptions.
We propose two techniques to make attention more stable through two general techniques.
First, our Token-aware Average Pooling (TAP) module encourages the local neighborhood of each token to take part in the attention mechanism.
Second, we force the output tokens to aggregate information from a diverse set of input tokens rather than focusing on just a few.
arXiv Detail & Related papers (2023-03-20T14:04:40Z) - Breaking BERT: Evaluating and Optimizing Sparsified Attention [13.529939025511242]
We evaluate the impact of sparsification patterns with a series of ablation experiments.
We find that even using attention that is at least 78% sparse can have little effect on performance if applied at later transformer layers.
arXiv Detail & Related papers (2022-10-07T22:32:27Z) - Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully
Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost.
Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z) - PSViT: Better Vision Transformer via Token Pooling and Attention Sharing [114.8051035856023]
We propose a PSViT: a ViT with token Pooling and attention Sharing to reduce the redundancy.
Experimental results show that the proposed scheme can achieve up to 6.6% accuracy improvement in ImageNet classification.
arXiv Detail & Related papers (2021-08-07T11:30:54Z) - DynamicViT: Efficient Vision Transformers with Dynamic Token
Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input.
By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%.
DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z) - KVT: k-NN Attention for Boosting Vision Transformers [44.189475770152185]
We propose a sparse attention scheme, dubbed k-NN attention, for boosting vision transformers.
The proposed k-NN attention naturally inherits the local bias of CNNs without introducing convolutional operations.
We verify, both theoretically and empirically, that $k$-NN attention is powerful in distilling noise from input tokens and in speeding up training.
arXiv Detail & Related papers (2021-05-28T06:49:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.