Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with
Adaptive Sequence Length
- URL: http://arxiv.org/abs/2105.15075v1
- Date: Mon, 31 May 2021 16:04:10 GMT
- Title: Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with
Adaptive Sequence Length
- Authors: Yulin Wang, Rui Huang, Shiji Song, Zeyi Huang, Gao Huang
- Abstract summary: Vision Transformers (ViT) have achieved remarkable success in large-scale image recognition.
To achieve a decent trade-off between accuracy and speed, the number of tokens is empirically set to 16x16.
We propose a Dynamic Transformer to automatically configure a proper number of tokens for each input image.
- Score: 40.35853878334764
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViT) have achieved remarkable success in large-scale
image recognition. They split every 2D image into a fixed number of patches,
each of which is treated as a token. Generally, representing an image with more
tokens would lead to higher prediction accuracy, while it also results in
drastically increased computational cost. To achieve a decent trade-off between
accuracy and speed, the number of tokens is empirically set to 16x16. In this
paper, we argue that every image has its own characteristics, and ideally the
token number should be conditioned on each individual input. In fact, we have
observed that there exist a considerable number of "easy" images which can be
accurately predicted with a mere number of 4x4 tokens, while only a small
fraction of "hard" ones need a finer representation. Inspired by this
phenomenon, we propose a Dynamic Transformer to automatically configure a
proper number of tokens for each input image. This is achieved by cascading
multiple Transformers with increasing numbers of tokens, which are sequentially
activated in an adaptive fashion at test time, i.e., the inference is
terminated once a sufficiently confident prediction is produced. We further
design efficient feature reuse and relationship reuse mechanisms across
different components of the Dynamic Transformer to reduce redundant
computations. Extensive empirical results on ImageNet, CIFAR-10, and CIFAR-100
demonstrate that our method significantly outperforms the competitive baselines
in terms of both theoretical computational efficiency and practical inference
speed.
Related papers
- Make A Long Image Short: Adaptive Token Length for Vision Transformers [5.723085628967456]
We propose an innovative approach to accelerate the ViT model by shortening long images.
Specifically, we introduce a method for adaptively assigning token length for each image at test time to accelerate inference speed.
arXiv Detail & Related papers (2023-07-05T08:10:17Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - Multi-Tailed Vision Transformer for Efficient Inference [44.43126137573205]
Vision Transformer (ViT) has achieved promising performance in image recognition.
We propose a Multi-Tailed Vision Transformer (MT-ViT) in the paper.
MT-ViT adopts multiple tails to produce visual sequences of different lengths for the following Transformer encoder.
arXiv Detail & Related papers (2022-03-03T09:30:55Z) - XCiT: Cross-Covariance Image Transformers [73.33400159139708]
We propose a "transposed" version of self-attention that operates across feature channels rather than tokens.
The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
arXiv Detail & Related papers (2021-06-17T17:33:35Z) - DynamicViT: Efficient Vision Transformers with Dynamic Token
Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input.
By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%.
DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z) - CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image
Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features.
Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity.
Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z) - Visual Transformers: Token-based Image Representation and Processing for
Computer Vision [67.55770209540306]
Visual Transformer ( VT) operates in a semantic token space, judiciously attending to different image parts based on context.
Using an advanced training recipe, our VTs significantly outperform their convolutional counterparts.
For semantic segmentation on LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points higher mIoU while reducing the FPN module's FLOPs by 6.5x.
arXiv Detail & Related papers (2020-06-05T20:49:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.