Related papers: AiluRus: A Scalable ViT Framework for Dense Prediction

AiluRus: A Scalable ViT Framework for Dense Prediction

URL: http://arxiv.org/abs/2311.01197v1
Date: Thu, 2 Nov 2023 12:48:43 GMT
Title: AiluRus: A Scalable ViT Framework for Dense Prediction
Authors: Jin Li, Yaoming Wang, Xiaopeng Zhang, Bowen Shi, Dongsheng Jiang, Chenglin Li, Wenrui Dai, Hongkai Xiong, Qi Tian
Abstract summary: Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. We propose to apply adaptive resolution for different regions in the image according to their importance. We evaluate our proposed method on three different datasets and observe promising performance.
Score: 95.1313839257891
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. However, when it comes to handling long token sequences, especially in dense prediction tasks that require high-resolution input, the complexity of ViTs increases significantly. Notably, dense prediction tasks, such as semantic segmentation or object detection, emphasize more on the contours or shapes of objects, while the texture inside objects is less informative. Motivated by this observation, we propose to apply adaptive resolution for different regions in the image according to their importance. Specifically, at the intermediate layer of the ViT, we utilize a spatial-aware density-based clustering algorithm to select representative tokens from the token sequence. Once the representative tokens are determined, we proceed to merge other tokens into their closest representative token. Consequently, semantic similar tokens are merged together to form low-resolution regions, while semantic irrelevant tokens are preserved independently as high-resolution regions. This strategy effectively reduces the number of tokens, allowing subsequent layers to handle a reduced token sequence and achieve acceleration. We evaluate our proposed method on three different datasets and observe promising performance. For example, the "Segmenter ViT-L" model can be accelerated by 48% FPS without fine-tuning, while maintaining the performance. Additionally, our method can be applied to accelerate fine-tuning as well. Experimental results demonstrate that we can save 52% training time while accelerating 2.46 times FPS with only a 0.09% performance drop. The code is available at https://github.com/caddyless/ailurus/tree/main.

Related papers

SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator [65.62084602011596]
Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. We have identified a key pattern: certain seemingly meaningless separator tokens (i.e., punctuations) contribute disproportionately to attention scores compared to semantically meaningful tokens. We introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens.
arXiv Detail & Related papers (2024-12-16T18:58:57Z)
ToSA: Token Selective Attention for Efficient Vision Transformers [50.13756218204456]
ToSA is a token selective attention approach that can identify tokens that need to be attended as well as those that can skip a transformer layer. We show that ToSA can significantly reduce computation costs while maintaining accuracy on the ImageNet classification benchmark.
arXiv Detail & Related papers (2024-06-13T05:17:21Z)
Accelerating Transformers with Spectrum-Preserving Token Merging [43.463808781808645]
PiToMe prioritizes the preservation of informative tokens using an additional metric termed the energy score. Experimental findings demonstrate that PiToMe saved from 40-60% FLOPs of the base models.
arXiv Detail & Related papers (2024-05-25T09:37:01Z)
No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling [55.203866875294516]
Vision Transformers (ViTs) have demonstrated outstanding performance in computer vision tasks. Various token pruning techniques have been introduced to alleviate the high computational burden of ViTs. We propose IdleViT, a dynamic token-idle-based method that achieves an excellent trade-off between performance and efficiency.
arXiv Detail & Related papers (2023-10-09T12:10:41Z)
Revisiting Token Pruning for Object Detection and Instance Segmentation [25.3324628669201]
We investigate token pruning to accelerate inference for object and instance segmentation. We show a reduction in performance decline from 1.5 mAP to 0.3 mAP in both boxes and masks, compared to existing token pruning methods.
arXiv Detail & Related papers (2023-06-12T11:55:33Z)
Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers [34.19166698049552]
Vision Transformers (ViT) have shown their competitive advantages performance-wise compared to convolutional neural networks (CNNs) We propose a novel approach to learn instance-dependent attention patterns, by devising a lightweight connectivity predictor module. We show that our method reduces 48% to 69% FLOPs of MHSA while the accuracy drop is within 0.4%.
arXiv Detail & Related papers (2023-03-24T02:12:28Z)
Making Vision Transformers Efficient from A Token Sparsification View [26.42498120556985]
We propose a novel Semantic Token ViT (STViT) for efficient global and local vision transformers. Our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone. In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks.
arXiv Detail & Related papers (2023-03-15T15:12:36Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation. We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration. The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z)
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input. By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%. DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.