Dynamic Token-Pass Transformers for Semantic Segmentation
- URL: http://arxiv.org/abs/2308.01944v1
- Date: Thu, 3 Aug 2023 06:14:24 GMT
- Title: Dynamic Token-Pass Transformers for Semantic Segmentation
- Authors: Yuang Liu, Qiang Zhou, Jing Wang, Fan Wang, Jun Wang, Wei Zhang
- Abstract summary: We introduce dynamic token-pass vision transformers (DoViT) for semantic segmentation.
DoViT gradually stops partial easy tokens from self-attention calculation and keeps the hard tokens forwarding until meeting the stopping criteria.
Our method greatly reduces about 40% $sim$ 60% FLOPs and the drop of mIoU is within 0.8% for various segmentation transformers.
- Score: 22.673910995773262
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision transformers (ViT) usually extract features via forwarding all the
tokens in the self-attention layers from top to toe. In this paper, we
introduce dynamic token-pass vision transformers (DoViT) for semantic
segmentation, which can adaptively reduce the inference cost for images with
different complexity. DoViT gradually stops partial easy tokens from
self-attention calculation and keeps the hard tokens forwarding until meeting
the stopping criteria. We employ lightweight auxiliary heads to make the
token-pass decision and divide the tokens into keeping/stopping parts. With a
token separate calculation, the self-attention layers are speeded up with
sparse tokens and still work friendly with hardware. A token reconstruction
module is built to collect and reset the grouped tokens to their original
position in the sequence, which is necessary to predict correct semantic masks.
We conduct extensive experiments on two common semantic segmentation tasks, and
demonstrate that our method greatly reduces about 40% $\sim$ 60% FLOPs and the
drop of mIoU is within 0.8% for various segmentation transformers. The
throughput and inference speed of ViT-L/B are increased to more than 2$\times$
on Cityscapes.
Related papers
- LookupViT: Compressing visual information to a limited number of tokens [36.83826969693139]
Vision Transformers (ViT) have emerged as the de-facto choice for numerous industry grade vision solutions.
But their inference cost can be prohibitive for many settings, as they compute self-attention in each layer which suffers from complexity in the number of tokens.
In this work, we introduce LookupViT, that exploits this information sparsity to reduce ViT inference cost.
arXiv Detail & Related papers (2024-07-17T17:22:43Z) - LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation [37.72775203647514]
This paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information and improve inference speed.
By employing Dual Cross-Attention (DCA) in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes.
Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 times$ speedup, fewer parameters, and competitive performance compared to the baseline models.
arXiv Detail & Related papers (2024-05-16T03:26:06Z) - AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance.
We propose to apply adaptive resolution for different regions in the image according to their importance.
We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z) - Dynamic Token Pruning in Plain Vision Transformers for Semantic
Segmentation [18.168932826183024]
This work introduces a Dynamic Token Pruning (DToP) method based on the early exit of tokens for semantic segmentation.
Experiments suggest that the proposed DToP architecture reduces on average $20% - 35%$ of computational cost for current semantic segmentation methods.
arXiv Detail & Related papers (2023-08-02T09:40:02Z) - Making Vision Transformers Efficient from A Token Sparsification View [26.42498120556985]
We propose a novel Semantic Token ViT (STViT) for efficient global and local vision transformers.
Our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone.
In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks.
arXiv Detail & Related papers (2023-03-15T15:12:36Z) - Token-Label Alignment for Vision Transformers [93.58540411138164]
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs)
We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies.
We propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
arXiv Detail & Related papers (2022-10-12T17:54:32Z) - SegViT: Semantic Segmentation with Plain Vision Transformers [91.50075506561598]
We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation.
We propose the Attention-to-Mask (ATM) module, in which similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks.
Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone.
arXiv Detail & Related papers (2022-10-12T00:30:26Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer [63.99222215387881]
We propose Evo-ViT, a self-motivated slow-fast token evolution method for vision transformers.
Our method can significantly reduce the computational costs of vision transformers while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2021-08-03T09:56:07Z) - DynamicViT: Efficient Vision Transformers with Dynamic Token
Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input.
By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%.
DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.