Speed-up of Vision Transformer Models by Attention-aware Token Filtering
- URL: http://arxiv.org/abs/2506.01519v1
- Date: Mon, 02 Jun 2025 10:34:55 GMT
- Title: Speed-up of Vision Transformer Models by Attention-aware Token Filtering
- Authors: Takahiro Naruko, Hiroaki Akutsu,
- Abstract summary: We propose a novel speed-up method for ViT models called Attention-aware Token Filtering (ATF)<n>ATF consists of two main ideas: a novel token filtering module and a filtering strategy.<n>ATF provides $2.8times$ speed-up to a ViT model, SigLIP, while maintaining the retrieval recall rate.
- Score: 6.061938153713551
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformer (ViT) models have made breakthroughs in image embedding extraction, which provide state-of-the-art performance in tasks such as zero-shot image classification. However, the models suffer from a high computational burden. In this paper, we propose a novel speed-up method for ViT models called Attention-aware Token Filtering (ATF). ATF consists of two main ideas: a novel token filtering module and a filtering strategy. The token filtering module is introduced between a tokenizer and a transformer encoder of the ViT model, without modifying or fine-tuning of the transformer encoder. The module filters out tokens inputted to the encoder so that it keeps tokens in regions of specific object types dynamically and keeps tokens in regions that statically receive high attention in the transformer encoder. This filtering strategy maintains task accuracy while filtering out tokens inputted to the transformer encoder. Evaluation results on retrieval tasks show that ATF provides $2.8\times$ speed-up to a ViT model, SigLIP, while maintaining the retrieval recall rate.
Related papers
- Frequency-Dynamic Attention Modulation for Dense Prediction [14.066404173580864]
We propose a circuit-theory-inspired strategy called Frequency-Dynamic Attention Modulation (FDAM)<n>FDAM directly modulates the overall frequency response of Vision Transformers (ViTs)
arXiv Detail & Related papers (2025-07-16T07:59:54Z) - IoT Botnet Detection: Application of Vision Transformer to Classification of Network Flow Traffic [0.0]
This work introduces a novel preprocessing method to adapt transformer models, the vision transformer (ViT) in particular, for IoT botnet attack detection using network flow packets.<n>The approach involves feature extraction from.pcap files and transforming each instance into a 1-channel 2D image shape, enabling ViT-based classification.
arXiv Detail & Related papers (2025-04-26T03:19:19Z) - A temporal scale transformer framework for precise remaining useful life prediction in fuel cells [10.899223392837936]
Temporal Scale Transformer (TSTransformer) is an enhanced version of the inverted Transformer (iTransformer)<n>Unlike traditional Transformers that treat each timestep as an input token, TSTransformer maps sequences of varying lengths into tokens at different stages for inter-sequence modeling.<n>It improves local feature extraction, captures temporal scale characteristics, and reduces token count and computational costs.
arXiv Detail & Related papers (2025-04-08T23:42:54Z) - Your ViT is Secretly an Image Segmentation Model [50.71238842539735]
Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks.<n>We show that inductive biases introduced by task-specific components can instead be learned by the ViT itself.<n>We introduce the Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation.
arXiv Detail & Related papers (2025-03-24T19:56:02Z) - Identity-Preserving Text-to-Video Generation by Frequency Decomposition [52.19475797580653]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in literature.<n>We propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human identity consistent in the generated video.
arXiv Detail & Related papers (2024-11-26T13:58:24Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain
Analysis: From Theory to Practice [111.47461527901318]
Vision Transformer (ViT) has recently demonstrated promise in computer vision problems.
ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity.
We propose two techniques to mitigate the undesirable low-pass limitation.
arXiv Detail & Related papers (2022-03-09T23:55:24Z) - ATS: Adaptive Token Sampling For Efficient Vision Transformers [33.297806854292155]
We introduce a differentiable parameter-free Adaptive Token Sampling (ATS) module, which can be plugged into any existing vision transformer architecture.
ATS empowers vision transformers by scoring and adaptively sampling significant tokens.
Our evaluations show that the proposed module improves the state-of-the-art by reducing the computational cost (GFLOPs) by 37% while preserving the accuracy.
arXiv Detail & Related papers (2021-11-30T18:56:57Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - DynamicViT: Efficient Vision Transformers with Dynamic Token
Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input.
By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%.
DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z) - Do We Really Need Explicit Position Encodings for Vision Transformers? [29.7662570764424]
We propose a conditional position encoding scheme, which is conditioned on the local neighborhood of the input token.
Our new model with PEG is named Visual Transformer (CPVT) and can naturally process the input sequences of arbitrary length.
We demonstrate that CPVT can result in visually similar attention maps and even better performance than those with predefined positional encodings.
arXiv Detail & Related papers (2021-02-22T10:29:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.