Make A Long Image Short: Adaptive Token Length for Vision Transformers
- URL: http://arxiv.org/abs/2307.02092v1
- Date: Wed, 5 Jul 2023 08:10:17 GMT
- Title: Make A Long Image Short: Adaptive Token Length for Vision Transformers
- Authors: Qiqi Zhou and Yichen Zhu
- Abstract summary: We propose an innovative approach to accelerate the ViT model by shortening long images.
Specifically, we introduce a method for adaptively assigning token length for each image at test time to accelerate inference speed.
- Score: 5.723085628967456
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The vision transformer is a model that breaks down each image into a sequence
of tokens with a fixed length and processes them similarly to words in natural
language processing. Although increasing the number of tokens typically results
in better performance, it also leads to a considerable increase in
computational cost. Motivated by the saying "A picture is worth a thousand
words," we propose an innovative approach to accelerate the ViT model by
shortening long images. Specifically, we introduce a method for adaptively
assigning token length for each image at test time to accelerate inference
speed. First, we train a Resizable-ViT (ReViT) model capable of processing
input with diverse token lengths. Next, we extract token-length labels from
ReViT that indicate the minimum number of tokens required to achieve accurate
predictions. We then use these labels to train a lightweight Token-Length
Assigner (TLA) that allocates the optimal token length for each image during
inference. The TLA enables ReViT to process images with the minimum sufficient
number of tokens, reducing token numbers in the ViT model and improving
inference speed. Our approach is general and compatible with modern vision
transformer architectures, significantly reducing computational costs. We
verified the effectiveness of our methods on multiple representative ViT models
on image classification and action recognition.
Related papers
- ImageFolder: Autoregressive Image Generation with Folded Tokens [51.815319504939396]
Increasing token length is a common approach to improve the image reconstruction quality.
There exists a trade-off between reconstruction and generation quality regarding token length.
We propose Image, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling.
arXiv Detail & Related papers (2024-10-02T17:06:39Z) - SA$^2$VP: Spatially Aligned-and-Adapted Visual Prompt [59.280491260635266]
Methods for visual prompt tuning follow the sequential modeling paradigm stemming from NLP.
Mymodel model learns a two-dimensional prompt token map with equal (or scaled) size to the image token map.
Our model can conduct individual prompting for different image tokens in a fine-grained manner.
arXiv Detail & Related papers (2023-12-16T08:23:43Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - Multi-Tailed Vision Transformer for Efficient Inference [44.43126137573205]
Vision Transformer (ViT) has achieved promising performance in image recognition.
We propose a Multi-Tailed Vision Transformer (MT-ViT) in the paper.
MT-ViT adopts multiple tails to produce visual sequences of different lengths for the following Transformer encoder.
arXiv Detail & Related papers (2022-03-03T09:30:55Z) - Not All Patches are What You Need: Expediting Vision Transformers via
Token Reorganizations [37.11387992603467]
Vision Transformers (ViTs) take all the image patches as tokens and construct multi-head self-attention (MHSA) among them.
Examples include tokens containing semantically meaningless or distractive image backgrounds.
We propose to reorganize image tokens during the feed-forward process of ViT models, which is integrated into ViT during training.
arXiv Detail & Related papers (2022-02-16T00:19:42Z) - Make A Long Image Short: Adaptive Token Length for Vision Transformers [17.21663067385715]
Vision transformer splits each image into a sequence of tokens with fixed length and processes the tokens in the same way as words in natural language processing.
We propose a novel approach to assign token length adaptively during inference.
arXiv Detail & Related papers (2021-12-03T02:48:51Z) - DynamicViT: Efficient Vision Transformers with Dynamic Token
Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input.
By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%.
DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z) - Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with
Adaptive Sequence Length [40.35853878334764]
Vision Transformers (ViT) have achieved remarkable success in large-scale image recognition.
To achieve a decent trade-off between accuracy and speed, the number of tokens is empirically set to 16x16.
We propose a Dynamic Transformer to automatically configure a proper number of tokens for each input image.
arXiv Detail & Related papers (2021-05-31T16:04:10Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z) - Tokens-to-Token ViT: Training Vision Transformers from Scratch on
ImageNet [128.96032932640364]
We propose a new Tokens-To-Token Vision Transformers (T2T-ViT) to solve vision tasks.
T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200%, while achieving more than 2.5% improvement when trained from scratch on ImageNet.
For example, T2T-ViT with ResNet50 comparable size can achieve 80.7% top-1 accuracy on ImageNet.
arXiv Detail & Related papers (2021-01-28T13:25:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.