DiT: Efficient Vision Transformers with Dynamic Token Routing
- URL: http://arxiv.org/abs/2308.03409v2
- Date: Fri, 11 Aug 2023 13:53:19 GMT
- Title: DiT: Efficient Vision Transformers with Dynamic Token Routing
- Authors: Yuchen Ma, Zhengcong Fei, Junshi Huang
- Abstract summary: We propose a data-dependent token routing strategy to elaborate the routing paths of image tokens for Dynamic Vision Transformer, dubbed DiT.
The proposed framework generates a data-dependent path per token, adapting to the object scales and visual discrimination of tokens.
In experiments, our DiT achieves superior performance and favorable complexity/accuracy trade-offs than many SoTA methods on ImageNet classification, object detection, instance segmentation, and semantic segmentation.
- Score: 37.808078064528374
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recently, the tokens of images share the same static data flow in many dense
networks. However, challenges arise from the variance among the objects in
images, such as large variations in the spatial scale and difficulties of
recognition for visual entities. In this paper, we propose a data-dependent
token routing strategy to elaborate the routing paths of image tokens for
Dynamic Vision Transformer, dubbed DiT. The proposed framework generates a
data-dependent path per token, adapting to the object scales and visual
discrimination of tokens. In feed-forward, the differentiable routing gates are
designed to select the scaling paths and feature transformation paths for image
tokens, leading to multi-path feature propagation. In this way, the impact of
object scales and visual discrimination of image representation can be
carefully tuned. Moreover, the computational cost can be further reduced by
giving budget constraints to the routing gate and early-stopping of feature
extraction. In experiments, our DiT achieves superior performance and favorable
complexity/accuracy trade-offs than many SoTA methods on ImageNet
classification, object detection, instance segmentation, and semantic
segmentation. Particularly, the DiT-B5 obtains 84.8\% top-1 Acc on ImageNet
with 10.3 GFLOPs, which is 1.0\% higher than that of the SoTA method with
similar computational complexity. These extensive results demonstrate that DiT
can serve as versatile backbones for various vision tasks.
Related papers
- CAT: Content-Adaptive Image Tokenization [92.2116487267877]
We introduce Content-Adaptive Tokenizer (CAT), which adjusts representation capacity based on the image content and encodes simpler images into fewer tokens.
We design a caption-based evaluation system that leverages large language models (LLMs) to predict content complexity and determine the optimal compression ratio for a given image.
By optimizing token allocation, CAT improves the FID score over fixed-ratio baselines trained with the same flops and boosts the inference throughput by 18.5%.
arXiv Detail & Related papers (2025-01-06T16:28:47Z) - Patch Is Not All You Need [57.290256181083016]
We propose a novel Pattern Transformer to adaptively convert images to pattern sequences for Transformer input.
We employ the Convolutional Neural Network to extract various patterns from the input image.
We have accomplished state-of-the-art performance on CIFAR-10 and CIFAR-100, and have achieved competitive results on ImageNet.
arXiv Detail & Related papers (2023-08-21T13:54:00Z) - Unsupervised Domain Adaptation with Histogram-gated Image Translation
for Delayered IC Image Analysis [2.720699926154399]
Histogram-gated Image Translation (HGIT) is an unsupervised domain adaptation framework which transforms images from a given source dataset to the domain of a target dataset.
Our method achieves the best performance compared to the reported domain adaptation techniques, and is also reasonably close to the fully supervised benchmark.
arXiv Detail & Related papers (2022-09-27T15:53:22Z) - Transformer Meets Convolution: A Bilateral Awareness Net-work for
Semantic Segmentation of Very Fine Resolution Ur-ban Scene Images [6.460167724233707]
We propose a bilateral awareness network (BANet) which contains a dependency path and a texture path.
BANet captures the long-range relationships and fine-grained details in VFR images.
Experiments conducted on the three large-scale urban scene image segmentation datasets, i.e., ISPRS Vaihingen dataset, ISPRS Potsdam dataset, and UAVid dataset, demonstrate the effective-ness of BANet.
arXiv Detail & Related papers (2021-06-23T13:57:36Z) - XCiT: Cross-Covariance Image Transformers [73.33400159139708]
We propose a "transposed" version of self-attention that operates across feature channels rather than tokens.
The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
arXiv Detail & Related papers (2021-06-17T17:33:35Z) - AugNet: End-to-End Unsupervised Visual Representation Learning with
Image Augmentation [3.6790362352712873]
We propose AugNet, a new deep learning training paradigm to learn image features from a collection of unlabeled pictures.
Our experiments demonstrate that the method is able to represent the image in low dimensional space.
Unlike many deep-learning-based image retrieval algorithms, our approach does not require access to external annotated datasets.
arXiv Detail & Related papers (2021-06-11T09:02:30Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z) - Learning Dynamic Routing for Semantic Segmentation [86.56049245100084]
This paper studies a conceptually new method to alleviate the scale variance in semantic representation, named dynamic routing.
The proposed framework generates data-dependent routes, adapting to the scale distribution of each image.
To this end, a differentiable gating function, called soft conditional gate, is proposed to select scale transform paths on the fly.
arXiv Detail & Related papers (2020-03-23T17:22:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.