Related papers: Make A Long Image Short: Adaptive Token Length for Vision Transformers

Make A Long Image Short: Adaptive Token Length for Vision Transformers

URL: http://arxiv.org/abs/2112.01686v2
Date: Mon, 6 Dec 2021 03:24:53 GMT
Title: Make A Long Image Short: Adaptive Token Length for Vision Transformers
Authors: Yichen Zhu, Yuqin Zhu, Jie Du, Yi Wang, Zhicai Ou, Feifei Feng and Jian Tang
Abstract summary: Vision transformer splits each image into a sequence of tokens with fixed length and processes the tokens in the same way as words in natural language processing. We propose a novel approach to assign token length adaptively during inference.
Score: 17.21663067385715
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The vision transformer splits each image into a sequence of tokens with fixed length and processes the tokens in the same way as words in natural language processing. More tokens normally lead to better performance but considerably increased computational cost. Motivated by the proverb "A picture is worth a thousand words" we aim to accelerate the ViT model by making a long image short. To this end, we propose a novel approach to assign token length adaptively during inference. Specifically, we first train a ViT model, called Resizable-ViT (ReViT), that can process any given input with diverse token lengths. Then, we retrieve the "token-length label" from ReViT and use it to train a lightweight Token-Length Assigner (TLA). The token-length labels are the smallest number of tokens to split an image that the ReViT can make the correct prediction, and TLA is learned to allocate the optimal token length based on these labels. The TLA enables the ReViT to process the image with the minimum sufficient number of tokens during inference. Thus, the inference speed is boosted by reducing the token numbers in the ViT model. Our approach is general and compatible with modern vision transformer architectures and can significantly reduce computational expanse. We verified the effectiveness of our methods on multiple representative ViT models (DeiT, LV-ViT, and TimesFormer) across two tasks (image classification and action recognition).

Related papers

ImagePiece: Content-aware Re-tokenization for Efficient Image Recognition [8.07235516190038]
Vision Transformers (ViTs) have achieved remarkable success in various computer vision tasks. Recent works aim to reduce the number of tokens, mainly focusing on how to effectively prune or merge them. We propose ImagePiece, a novel re-tokenization strategy for Vision Transformers.
arXiv Detail & Related papers (2024-12-21T05:38:20Z)
VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation [18.9885501527331]
Vision Transformers (ViTs) have emerged as the backbone of many segmentation models, consistently achieving state-of-the-art (SOTA) performance. Image token pruning is one of the most effective strategies to address this complexity. This work introduces the Vision Language Guided Token Pruning (VLTP), a novel token pruning mechanism that can accelerate ViTbased segmentation models.
arXiv Detail & Related papers (2024-09-13T01:30:24Z)
Matryoshka Query Transformer for Large Vision-Language Models [103.84600181927884]
We introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference. We train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens. Our model, MQT-LLAVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA's fixed 576.
arXiv Detail & Related papers (2024-05-29T17:39:42Z)
No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling [55.203866875294516]
Vision Transformers (ViTs) have demonstrated outstanding performance in computer vision tasks. Various token pruning techniques have been introduced to alleviate the high computational burden of ViTs. We propose IdleViT, a dynamic token-idle-based method that achieves an excellent trade-off between performance and efficiency.
arXiv Detail & Related papers (2023-10-09T12:10:41Z)
Make A Long Image Short: Adaptive Token Length for Vision Transformers [5.723085628967456]
We propose an innovative approach to accelerate the ViT model by shortening long images. Specifically, we introduce a method for adaptively assigning token length for each image at test time to accelerate inference speed.
arXiv Detail & Related papers (2023-07-05T08:10:17Z)
Multi-Tailed Vision Transformer for Efficient Inference [44.43126137573205]
Vision Transformer (ViT) has achieved promising performance in image recognition. We propose a Multi-Tailed Vision Transformer (MT-ViT) in the paper. MT-ViT adopts multiple tails to produce visual sequences of different lengths for the following Transformer encoder.
arXiv Detail & Related papers (2022-03-03T09:30:55Z)
Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations [37.11387992603467]
Vision Transformers (ViTs) take all the image patches as tokens and construct multi-head self-attention (MHSA) among them. Examples include tokens containing semantically meaningless or distractive image backgrounds. We propose to reorganize image tokens during the feed-forward process of ViT models, which is integrated into ViT during training.
arXiv Detail & Related papers (2022-02-16T00:19:42Z)
Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks. We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT. Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z)
On Improving Adversarial Transferability of Vision Transformers [97.17154635766578]
Vision transformers (ViTs) process input images as sequences of patches via self-attention. We study the adversarial feature space of ViT models and their transferability. We introduce two novel strategies specific to the architecture of ViT models.
arXiv Detail & Related papers (2021-06-08T08:20:38Z)
Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length. It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity. Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet [128.96032932640364]
We propose a new Tokens-To-Token Vision Transformers (T2T-ViT) to solve vision tasks. T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200%, while achieving more than 2.5% improvement when trained from scratch on ImageNet. For example, T2T-ViT with ResNet50 comparable size can achieve 80.7% top-1 accuracy on ImageNet.
arXiv Detail & Related papers (2021-01-28T13:25:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.