Make A Long Image Short: Adaptive Token Length for Vision Transformers
- URL: http://arxiv.org/abs/2112.01686v2
- Date: Mon, 6 Dec 2021 03:24:53 GMT
- Title: Make A Long Image Short: Adaptive Token Length for Vision Transformers
- Authors: Yichen Zhu, Yuqin Zhu, Jie Du, Yi Wang, Zhicai Ou, Feifei Feng and
Jian Tang
- Abstract summary: Vision transformer splits each image into a sequence of tokens with fixed length and processes the tokens in the same way as words in natural language processing.
We propose a novel approach to assign token length adaptively during inference.
- Score: 17.21663067385715
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The vision transformer splits each image into a sequence of tokens with fixed
length and processes the tokens in the same way as words in natural language
processing. More tokens normally lead to better performance but considerably
increased computational cost. Motivated by the proverb "A picture is worth a
thousand words" we aim to accelerate the ViT model by making a long image
short. To this end, we propose a novel approach to assign token length
adaptively during inference. Specifically, we first train a ViT model, called
Resizable-ViT (ReViT), that can process any given input with diverse token
lengths. Then, we retrieve the "token-length label" from ReViT and use it to
train a lightweight Token-Length Assigner (TLA). The token-length labels are
the smallest number of tokens to split an image that the ReViT can make the
correct prediction, and TLA is learned to allocate the optimal token length
based on these labels. The TLA enables the ReViT to process the image with the
minimum sufficient number of tokens during inference. Thus, the inference speed
is boosted by reducing the token numbers in the ViT model. Our approach is
general and compatible with modern vision transformer architectures and can
significantly reduce computational expanse. We verified the effectiveness of
our methods on multiple representative ViT models (DeiT, LV-ViT, and
TimesFormer) across two tasks (image classification and action recognition).
Related papers
- VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation [18.9885501527331]
Vision Transformers (ViTs) have emerged as the backbone of many segmentation models, consistently achieving state-of-the-art (SOTA) performance.
Image token pruning is one of the most effective strategies to address this complexity.
This work introduces the Vision Language Guided Token Pruning (VLTP), a novel token pruning mechanism that can accelerate ViTbased segmentation models.
arXiv Detail & Related papers (2024-09-13T01:30:24Z) - Matryoshka Query Transformer for Large Vision-Language Models [103.84600181927884]
We introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference.
We train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens.
Our model, MQT-LLAVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA's fixed 576.
arXiv Detail & Related papers (2024-05-29T17:39:42Z) - No Token Left Behind: Efficient Vision Transformer via Dynamic Token
Idling [55.203866875294516]
Vision Transformers (ViTs) have demonstrated outstanding performance in computer vision tasks.
Various token pruning techniques have been introduced to alleviate the high computational burden of ViTs.
We propose IdleViT, a dynamic token-idle-based method that achieves an excellent trade-off between performance and efficiency.
arXiv Detail & Related papers (2023-10-09T12:10:41Z) - Make A Long Image Short: Adaptive Token Length for Vision Transformers [5.723085628967456]
We propose an innovative approach to accelerate the ViT model by shortening long images.
Specifically, we introduce a method for adaptively assigning token length for each image at test time to accelerate inference speed.
arXiv Detail & Related papers (2023-07-05T08:10:17Z) - Multi-Tailed Vision Transformer for Efficient Inference [44.43126137573205]
Vision Transformer (ViT) has achieved promising performance in image recognition.
We propose a Multi-Tailed Vision Transformer (MT-ViT) in the paper.
MT-ViT adopts multiple tails to produce visual sequences of different lengths for the following Transformer encoder.
arXiv Detail & Related papers (2022-03-03T09:30:55Z) - Not All Patches are What You Need: Expediting Vision Transformers via
Token Reorganizations [37.11387992603467]
Vision Transformers (ViTs) take all the image patches as tokens and construct multi-head self-attention (MHSA) among them.
Examples include tokens containing semantically meaningless or distractive image backgrounds.
We propose to reorganize image tokens during the feed-forward process of ViT models, which is integrated into ViT during training.
arXiv Detail & Related papers (2022-02-16T00:19:42Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - On Improving Adversarial Transferability of Vision Transformers [97.17154635766578]
Vision transformers (ViTs) process input images as sequences of patches via self-attention.
We study the adversarial feature space of ViT models and their transferability.
We introduce two novel strategies specific to the architecture of ViT models.
arXiv Detail & Related papers (2021-06-08T08:20:38Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z) - Tokens-to-Token ViT: Training Vision Transformers from Scratch on
ImageNet [128.96032932640364]
We propose a new Tokens-To-Token Vision Transformers (T2T-ViT) to solve vision tasks.
T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200%, while achieving more than 2.5% improvement when trained from scratch on ImageNet.
For example, T2T-ViT with ResNet50 comparable size can achieve 80.7% top-1 accuracy on ImageNet.
arXiv Detail & Related papers (2021-01-28T13:25:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.