Related papers: Training-Free Acceleration of ViTs with Delayed Spatial Merging

Training-Free Acceleration of ViTs with Delayed Spatial Merging

URL: http://arxiv.org/abs/2303.02331v2
Date: Mon, 1 Jul 2024 10:16:38 GMT
Title: Training-Free Acceleration of ViTs with Delayed Spatial Merging
Authors: Jung Hwan Heo, Seyedarmin Azizi, Arash Fayyazi, Massoud Pedram,
Abstract summary: Token merging has emerged as a new paradigm that can accelerate the inference of Vision Transformers (ViTs) without any retraining or fine-tuning. We improve token merging by adding the perspectives of 1) activation outliers and 2) hierarchical representations. We build a unified inference framework called DSM: Delayed Spatial Merging.
Score: 4.523939613157408
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Token merging has emerged as a new paradigm that can accelerate the inference of Vision Transformers (ViTs) without any retraining or fine-tuning. To push the frontier of training-free acceleration in ViTs, we improve token merging by adding the perspectives of 1) activation outliers and 2) hierarchical representations. Through a careful analysis of the attention behavior in ViTs, we characterize a delayed onset of the convergent attention phenomenon, which makes token merging undesirable in the bottom blocks of ViTs. Moreover, we augment token merging with a hierarchical processing scheme to capture multi-scale redundancy between visual tokens. Combining these two insights, we build a unified inference framework called DSM: Delayed Spatial Merging. We extensively evaluate DSM on various ViT model scales (Tiny to Huge) and tasks (ImageNet-1k and transfer learning), achieving up to 1.8$\times$ FLOP reduction and 1.6$\times$ throughput speedup at a negligible loss while being two orders of magnitude faster than existing methods.

Related papers

Mechanistic Insights into Grokking from the Embedding Layer [15.676058752772287]
Grokking, a delayed generalization in neural networks, has been observed in Transformers and stagnates, but the components driving it remain underexplored.<n>We show that embeddings are central to grokking: introducing them intos induces delayed generalization in modular arithmetic tasks.<n>Our methods not only improve grokking dynamics but also extend to broader challenges in Transformer optimization, where bilinear interactions hinder efficient training.
arXiv Detail & Related papers (2025-05-21T15:12:34Z)
Improving Interpretation Faithfulness for Vision Transformers [42.86486715574245]
Vision Transformers (ViTs) have achieved state-of-the-art performance for various vision tasks. ViTs suffer from issues with explanation faithfulness, as their focal points are fragile to adversarial attacks. We propose a rigorous approach to mitigate these issues by introducing Faithful ViTs (FViTs)
arXiv Detail & Related papers (2023-11-29T18:51:21Z)
I&S-ViT: An Inclusive & Stable Method for Pushing the Limit of Post-Training ViTs Quantization [49.17407185195788]
We introduce I&S-ViT, a novel method that regulates the PTQ of ViTs in an inclusive and stable fashion. I&S-ViT elevates the performance of 3-bit ViT-B by an impressive 50.68%.
arXiv Detail & Related papers (2023-11-16T13:07:47Z)
Accelerating Vision Transformers Based on Heterogeneous Attention Patterns [89.86293867174324]
Vision Transformers (ViTs) have attracted a lot of attention in the field of computer vision. We propose an integrated compression pipeline based on observed heterogeneous attention patterns across layers. Experimentally, the integrated compression pipeline of DGSSA and GLAD can accelerate up to 121% run-time throughput.
arXiv Detail & Related papers (2023-10-11T17:09:19Z)
CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs [79.54107547233625]
Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks. We propose a joint compression method for ViTs that offers both high accuracy and fast inference speed. Our proposed method can achieve state-of-the-art performance across various ViTs.
arXiv Detail & Related papers (2023-09-27T16:12:07Z)
Deeper Insights into ViTs Robustness towards Common Corruptions [82.79764218627558]
We investigate how CNN-like architectural designs and CNN-based data augmentation strategies impact on ViTs' robustness towards common corruptions. We demonstrate that overlapping patch embedding and convolutional Feed-Forward Network (FFN) boost performance on robustness. We also introduce a novel conditional method enabling input-varied augmentations from two angles.
arXiv Detail & Related papers (2022-04-26T08:22:34Z)
Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice [111.47461527901318]
Vision Transformer (ViT) has recently demonstrated promise in computer vision problems. ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity. We propose two techniques to mitigate the undesirable low-pass limitation.
arXiv Detail & Related papers (2022-03-09T23:55:24Z)
Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks. We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT. Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z)
On Improving Adversarial Transferability of Vision Transformers [97.17154635766578]
Vision transformers (ViTs) process input images as sequences of patches via self-attention. We study the adversarial feature space of ViT models and their transferability. We introduce two novel strategies specific to the architecture of ViT models.
arXiv Detail & Related papers (2021-06-08T08:20:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.