Related papers: Make A Long Image Short: Adaptive Token Length for Vision Transformers

Make A Long Image Short: Adaptive Token Length for Vision Transformers

URL: http://arxiv.org/abs/2307.02092v1
Date: Wed, 5 Jul 2023 08:10:17 GMT
Title: Make A Long Image Short: Adaptive Token Length for Vision Transformers
Authors: Qiqi Zhou and Yichen Zhu
Abstract summary: We propose an innovative approach to accelerate the ViT model by shortening long images. Specifically, we introduce a method for adaptively assigning token length for each image at test time to accelerate inference speed.
Score: 5.723085628967456
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The vision transformer is a model that breaks down each image into a sequence of tokens with a fixed length and processes them similarly to words in natural language processing. Although increasing the number of tokens typically results in better performance, it also leads to a considerable increase in computational cost. Motivated by the saying "A picture is worth a thousand words," we propose an innovative approach to accelerate the ViT model by shortening long images. Specifically, we introduce a method for adaptively assigning token length for each image at test time to accelerate inference speed. First, we train a Resizable-ViT (ReViT) model capable of processing input with diverse token lengths. Next, we extract token-length labels from ReViT that indicate the minimum number of tokens required to achieve accurate predictions. We then use these labels to train a lightweight Token-Length Assigner (TLA) that allocates the optimal token length for each image during inference. The TLA enables ReViT to process images with the minimum sufficient number of tokens, reducing token numbers in the ViT model and improving inference speed. Our approach is general and compatible with modern vision transformer architectures, significantly reducing computational costs. We verified the effectiveness of our methods on multiple representative ViT models on image classification and action recognition.

Related papers

Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer. Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z)
ImagePiece: Content-aware Re-tokenization for Efficient Image Recognition [8.07235516190038]
Vision Transformers (ViTs) have achieved remarkable success in various computer vision tasks. Recent works aim to reduce the number of tokens, mainly focusing on how to effectively prune or merge them. We propose ImagePiece, a novel re-tokenization strategy for Vision Transformers.
arXiv Detail & Related papers (2024-12-21T05:38:20Z)
ImageFolder: Autoregressive Image Generation with Folded Tokens [51.815319504939396]
Increasing token length is a common approach to improve the image reconstruction quality. There exists a trade-off between reconstruction and generation quality regarding token length. We propose Image, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling.
arXiv Detail & Related papers (2024-10-02T17:06:39Z)
SA$^2$VP: Spatially Aligned-and-Adapted Visual Prompt [59.280491260635266]
Methods for visual prompt tuning follow the sequential modeling paradigm stemming from NLP. Mymodel model learns a two-dimensional prompt token map with equal (or scaled) size to the image token map. Our model can conduct individual prompting for different image tokens in a fine-grained manner.
arXiv Detail & Related papers (2023-12-16T08:23:43Z)
CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation. Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens. Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z)
Multi-Tailed Vision Transformer for Efficient Inference [44.43126137573205]
Vision Transformer (ViT) has achieved promising performance in image recognition. We propose a Multi-Tailed Vision Transformer (MT-ViT) in the paper. MT-ViT adopts multiple tails to produce visual sequences of different lengths for the following Transformer encoder.
arXiv Detail & Related papers (2022-03-03T09:30:55Z)
Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations [37.11387992603467]
Vision Transformers (ViTs) take all the image patches as tokens and construct multi-head self-attention (MHSA) among them. Examples include tokens containing semantically meaningless or distractive image backgrounds. We propose to reorganize image tokens during the feed-forward process of ViT models, which is integrated into ViT during training.
arXiv Detail & Related papers (2022-02-16T00:19:42Z)
Make A Long Image Short: Adaptive Token Length for Vision Transformers [17.21663067385715]
Vision transformer splits each image into a sequence of tokens with fixed length and processes the tokens in the same way as words in natural language processing. We propose a novel approach to assign token length adaptively during inference.
arXiv Detail & Related papers (2021-12-03T02:48:51Z)
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input. By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%. DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z)
Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length [40.35853878334764]
Vision Transformers (ViT) have achieved remarkable success in large-scale image recognition. To achieve a decent trade-off between accuracy and speed, the number of tokens is empirically set to 16x16. We propose a Dynamic Transformer to automatically configure a proper number of tokens for each input image.
arXiv Detail & Related papers (2021-05-31T16:04:10Z)
Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length. It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity. Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet [128.96032932640364]
We propose a new Tokens-To-Token Vision Transformers (T2T-ViT) to solve vision tasks. T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200%, while achieving more than 2.5% improvement when trained from scratch on ImageNet. For example, T2T-ViT with ResNet50 comparable size can achieve 80.7% top-1 accuracy on ImageNet.
arXiv Detail & Related papers (2021-01-28T13:25:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.