Accelerating Vision Transformers Based on Heterogeneous Attention
Patterns
- URL: http://arxiv.org/abs/2310.07664v1
- Date: Wed, 11 Oct 2023 17:09:19 GMT
- Title: Accelerating Vision Transformers Based on Heterogeneous Attention
Patterns
- Authors: Deli Yu, Teng Xi, Jianwei Li, Baopu Li, Gang Zhang, Haocheng Feng,
Junyu Han, Jingtuo Liu, Errui Ding, Jingdong Wang
- Abstract summary: Vision Transformers (ViTs) have attracted a lot of attention in the field of computer vision.
We propose an integrated compression pipeline based on observed heterogeneous attention patterns across layers.
Experimentally, the integrated compression pipeline of DGSSA and GLAD can accelerate up to 121% run-time throughput.
- Score: 89.86293867174324
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, Vision Transformers (ViTs) have attracted a lot of attention in the
field of computer vision. Generally, the powerful representative capacity of
ViTs mainly benefits from the self-attention mechanism, which has a high
computation complexity. To accelerate ViTs, we propose an integrated
compression pipeline based on observed heterogeneous attention patterns across
layers. On one hand, different images share more similar attention patterns in
early layers than later layers, indicating that the dynamic query-by-key
self-attention matrix may be replaced with a static self-attention matrix in
early layers. Then, we propose a dynamic-guided static self-attention (DGSSA)
method where the matrix inherits self-attention information from the replaced
dynamic self-attention to effectively improve the feature representation
ability of ViTs. On the other hand, the attention maps have more low-rank
patterns, which reflect token redundancy, in later layers than early layers. In
a view of linear dimension reduction, we further propose a method of global
aggregation pyramid (GLAD) to reduce the number of tokens in later layers of
ViTs, such as Deit. Experimentally, the integrated compression pipeline of
DGSSA and GLAD can accelerate up to 121% run-time throughput compared with
DeiT, which surpasses all SOTA approaches.
Related papers
- You Only Need Less Attention at Each Stage in Vision Transformers [19.660385306028047]
Vision Transformers (ViTs) capture the global information of images through self-attention modules.
We propose the Less-Attention Vision Transformer (LaViT), which computes only a few attention operations at each stage.
Our architecture demonstrates exceptional performance across various vision tasks including classification, detection and segmentation.
arXiv Detail & Related papers (2024-06-01T12:49:16Z) - Multi-Dimensional Hyena for Spatial Inductive Bias [69.3021852589771]
We present a data-efficient vision transformer that does not rely on self-attention.
Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer.
We show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures.
arXiv Detail & Related papers (2023-09-24T10:22:35Z) - Laplacian-Former: Overcoming the Limitations of Vision Transformers in
Local Texture Detection [3.784298636620067]
Vision Transformer (ViT) models have demonstrated a breakthrough in a wide range of computer vision tasks.
These models struggle to capture high-frequency components of images, which can limit their ability to detect local textures and edge information.
We propose a new technique, Laplacian-Former, that enhances the self-attention map by adaptively re-calibrating the frequency information in a Laplacian pyramid.
arXiv Detail & Related papers (2023-08-31T19:56:14Z) - A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism.
We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions.
With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z) - ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision
Transformer Acceleration with a Linear Taylor Attention [23.874485033096917]
Vision Transformer (ViT) has emerged as a competitive alternative to convolutional neural networks for various computer vision applications.
We propose a first-of-its-kind algorithm- hardware codesigned framework, dubbed ViTALiTy, for boosting the inference efficiency of ViTs.
ViTALiTy unifies both low-rank and sparse components of the attention in ViTs.
arXiv Detail & Related papers (2022-11-09T18:58:21Z) - On Improving Adversarial Transferability of Vision Transformers [97.17154635766578]
Vision transformers (ViTs) process input images as sequences of patches via self-attention.
We study the adversarial feature space of ViT models and their transferability.
We introduce two novel strategies specific to the architecture of ViT models.
arXiv Detail & Related papers (2021-06-08T08:20:38Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.