Expediting Large-Scale Vision Transformer for Dense Prediction without
Fine-tuning
- URL: http://arxiv.org/abs/2210.01035v1
- Date: Mon, 3 Oct 2022 15:49:48 GMT
- Title: Expediting Large-Scale Vision Transformer for Dense Prediction without
Fine-tuning
- Authors: Weicong Liang and Yuhui Yuan and Henghui Ding and Xiao Luo and Weihong
Lin and Ding Jia and Zheng Zhang and Chao Zhang and Han Hu
- Abstract summary: Many advanced approaches have been developed to reduce the total number of tokens in large-scale vision transformers.
We present two non-parametric operators, a token clustering layer to decrease the number of tokens and a token reconstruction layer to increase the number of tokens.
Results are promising on five dense prediction tasks, including object detection, semantic segmentation, panoptic segmentation, instance segmentation, and depth estimation.
- Score: 28.180891300826165
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision transformers have recently achieved competitive results across various
vision tasks but still suffer from heavy computation costs when processing a
large number of tokens. Many advanced approaches have been developed to reduce
the total number of tokens in large-scale vision transformers, especially for
image classification tasks. Typically, they select a small group of essential
tokens according to their relevance with the class token, then fine-tune the
weights of the vision transformer. Such fine-tuning is less practical for dense
prediction due to the much heavier computation and GPU memory cost than image
classification. In this paper, we focus on a more challenging problem, i.e.,
accelerating large-scale vision transformers for dense prediction without any
additional re-training or fine-tuning. In response to the fact that
high-resolution representations are necessary for dense prediction, we present
two non-parametric operators, a token clustering layer to decrease the number
of tokens and a token reconstruction layer to increase the number of tokens.
The following steps are performed to achieve this: (i) we use the token
clustering layer to cluster the neighboring tokens together, resulting in
low-resolution representations that maintain the spatial structures; (ii) we
apply the following transformer layers only to these low-resolution
representations or clustered tokens; and (iii) we use the token reconstruction
layer to re-create the high-resolution representations from the refined
low-resolution representations. The results obtained by our method are
promising on five dense prediction tasks, including object detection, semantic
segmentation, panoptic segmentation, instance segmentation, and depth
estimation.
Related papers
- Enhancing 3D Transformer Segmentation Model for Medical Image with Token-level Representation Learning [9.896550384001348]
This work proposes a token-level representation learning loss that maximizes agreement between token embeddings from different augmented views individually.
We also invent a simple "rotate-and-restore" mechanism, which rotates and flips one augmented view of input volume, and later restores the order of tokens in the feature maps.
We test our pre-training scheme on two public medical segmentation datasets, and the results on the downstream segmentation task show more improvement of our methods than other state-of-the-art pre-trainig methods.
arXiv Detail & Related papers (2024-08-12T01:49:13Z) - ToSA: Token Selective Attention for Efficient Vision Transformers [50.13756218204456]
ToSA is a token selective attention approach that can identify tokens that need to be attended as well as those that can skip a transformer layer.
We show that ToSA can significantly reduce computation costs while maintaining accuracy on the ImageNet classification benchmark.
arXiv Detail & Related papers (2024-06-13T05:17:21Z) - Dynamic Token Pruning in Plain Vision Transformers for Semantic
Segmentation [18.168932826183024]
This work introduces a Dynamic Token Pruning (DToP) method based on the early exit of tokens for semantic segmentation.
Experiments suggest that the proposed DToP architecture reduces on average $20% - 35%$ of computational cost for current semantic segmentation methods.
arXiv Detail & Related papers (2023-08-02T09:40:02Z) - Learned Thresholds Token Merging and Pruning for Vision Transformers [5.141687309207561]
This paper introduces Learned Thresholds token Merging and Pruning (LTMP), a novel approach that leverages the strengths of both token merging and token pruning.
We demonstrate our approach with extensive experiments on vision transformers on the ImageNet classification task.
arXiv Detail & Related papers (2023-07-20T11:30:12Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - PSViT: Better Vision Transformer via Token Pooling and Attention Sharing [114.8051035856023]
We propose a PSViT: a ViT with token Pooling and attention Sharing to reduce the redundancy.
Experimental results show that the proposed scheme can achieve up to 6.6% accuracy improvement in ImageNet classification.
arXiv Detail & Related papers (2021-08-07T11:30:54Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.