Training-Free Acceleration of ViTs with Delayed Spatial Merging
- URL: http://arxiv.org/abs/2303.02331v2
- Date: Mon, 1 Jul 2024 10:16:38 GMT
- Title: Training-Free Acceleration of ViTs with Delayed Spatial Merging
- Authors: Jung Hwan Heo, Seyedarmin Azizi, Arash Fayyazi, Massoud Pedram,
- Abstract summary: Token merging has emerged as a new paradigm that can accelerate the inference of Vision Transformers (ViTs) without any retraining or fine-tuning.
We improve token merging by adding the perspectives of 1) activation outliers and 2) hierarchical representations.
We build a unified inference framework called DSM: Delayed Spatial Merging.
- Score: 4.523939613157408
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Token merging has emerged as a new paradigm that can accelerate the inference of Vision Transformers (ViTs) without any retraining or fine-tuning. To push the frontier of training-free acceleration in ViTs, we improve token merging by adding the perspectives of 1) activation outliers and 2) hierarchical representations. Through a careful analysis of the attention behavior in ViTs, we characterize a delayed onset of the convergent attention phenomenon, which makes token merging undesirable in the bottom blocks of ViTs. Moreover, we augment token merging with a hierarchical processing scheme to capture multi-scale redundancy between visual tokens. Combining these two insights, we build a unified inference framework called DSM: Delayed Spatial Merging. We extensively evaluate DSM on various ViT model scales (Tiny to Huge) and tasks (ImageNet-1k and transfer learning), achieving up to 1.8$\times$ FLOP reduction and 1.6$\times$ throughput speedup at a negligible loss while being two orders of magnitude faster than existing methods.
Related papers
- Improving Interpretation Faithfulness for Vision Transformers [42.86486715574245]
Vision Transformers (ViTs) have achieved state-of-the-art performance for various vision tasks.
ViTs suffer from issues with explanation faithfulness, as their focal points are fragile to adversarial attacks.
We propose a rigorous approach to mitigate these issues by introducing Faithful ViTs (FViTs)
arXiv Detail & Related papers (2023-11-29T18:51:21Z) - I&S-ViT: An Inclusive & Stable Method for Pushing the Limit of Post-Training ViTs Quantization [49.17407185195788]
We introduce I&S-ViT, a novel method that regulates the PTQ of ViTs in an inclusive and stable fashion.
I&S-ViT elevates the performance of 3-bit ViT-B by an impressive 50.68%.
arXiv Detail & Related papers (2023-11-16T13:07:47Z) - Accelerating Vision Transformers Based on Heterogeneous Attention
Patterns [89.86293867174324]
Vision Transformers (ViTs) have attracted a lot of attention in the field of computer vision.
We propose an integrated compression pipeline based on observed heterogeneous attention patterns across layers.
Experimentally, the integrated compression pipeline of DGSSA and GLAD can accelerate up to 121% run-time throughput.
arXiv Detail & Related papers (2023-10-11T17:09:19Z) - CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and
Favorable Transferability For ViTs [79.54107547233625]
Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks.
We propose a joint compression method for ViTs that offers both high accuracy and fast inference speed.
Our proposed method can achieve state-of-the-art performance across various ViTs.
arXiv Detail & Related papers (2023-09-27T16:12:07Z) - Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain
Analysis: From Theory to Practice [111.47461527901318]
Vision Transformer (ViT) has recently demonstrated promise in computer vision problems.
ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity.
We propose two techniques to mitigate the undesirable low-pass limitation.
arXiv Detail & Related papers (2022-03-09T23:55:24Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - On Improving Adversarial Transferability of Vision Transformers [97.17154635766578]
Vision transformers (ViTs) process input images as sequences of patches via self-attention.
We study the adversarial feature space of ViT models and their transferability.
We introduce two novel strategies specific to the architecture of ViT models.
arXiv Detail & Related papers (2021-06-08T08:20:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.