Coarse-to-Fine Vision Transformer
- URL: http://arxiv.org/abs/2203.03821v1
- Date: Tue, 8 Mar 2022 02:57:49 GMT
- Title: Coarse-to-Fine Vision Transformer
- Authors: Mengzhao Chen, Mingbao Lin, Ke Li, Yunhang Shen, Yongjian Wu, Fei
Chao, Rongrong Ji
- Abstract summary: We propose a coarse-to-fine vision transformer (CF-ViT) to relieve computational burden while retaining performance.
Our proposed CF-ViT is motivated by two important observations in modern ViT models.
Our CF-ViT reduces 53% FLOPs of LV-ViT, and also achieves 2.01x throughput.
- Score: 83.45020063642235
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers (ViT) have made many breakthroughs in computer vision
tasks. However, considerable redundancy arises in the spatial dimension of an
input image, leading to massive computational costs. Therefore, We propose a
coarse-to-fine vision transformer (CF-ViT) to relieve computational burden
while retaining performance in this paper. Our proposed CF-ViT is motivated by
two important observations in modern ViT models: (1) The coarse-grained patch
splitting can locate informative regions of an input image. (2) Most images can
be well recognized by a ViT model in a small-length token sequence. Therefore,
our CF-ViT implements network inference in a two-stage manner. At coarse
inference stage, an input image is split into a small-length patch sequence for
a computationally economical classification. If not well recognized, the
informative patches are identified and further re-split in a fine-grained
granularity. Extensive experiments demonstrate the efficacy of our CF-ViT. For
example, without any compromise on performance, CF-ViT reduces 53% FLOPs of
LV-ViT, and also achieves 2.01x throughput.
Related papers
- Denoising Vision Transformers [43.03068202384091]
We propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT)
In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis.
In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision.
arXiv Detail & Related papers (2024-01-05T18:59:52Z) - LT-ViT: A Vision Transformer for multi-label Chest X-ray classification [2.3022732986382213]
Vision Transformers (ViTs) are widely adopted in medical imaging tasks, and some existing efforts have been directed towards vision-language training for Chest X-rays (CXRs)
We have developed LT-ViT, a transformer that utilizes combined attention between image tokens and randomly auxiliary tokens that represent labels.
arXiv Detail & Related papers (2023-11-13T12:02:46Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - Super Vision Transformer [131.4777773281238]
Experimental results on ImageNet demonstrate that our SuperViT can considerably reduce the computational costs of ViT models with even performance increase.
Our SuperViT significantly outperforms existing studies on efficient vision transformers.
arXiv Detail & Related papers (2022-05-23T15:42:12Z) - Deeper Insights into ViTs Robustness towards Common Corruptions [82.79764218627558]
We investigate how CNN-like architectural designs and CNN-based data augmentation strategies impact on ViTs' robustness towards common corruptions.
We demonstrate that overlapping patch embedding and convolutional Feed-Forward Network (FFN) boost performance on robustness.
We also introduce a novel conditional method enabling input-varied augmentations from two angles.
arXiv Detail & Related papers (2022-04-26T08:22:34Z) - ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator [21.351034332423374]
We propose a novel ViT based fine-grained object discriminator for Fine-Grained Visual Classification (FGVC) tasks.
Besides a ViT backbone, it introduces three novel components, i.e. Attention Patch Combination (APC), Critical Regions Filter (CRF) and Complementary Tokens Integration (CTI)
We conduct comprehensive experiments on widely used datasets and the results demonstrate that ViT-FOD is able to achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-03-24T02:34:57Z) - Vision Xformers: Efficient Attention for Image Classification [0.0]
We modify the ViT architecture to work on longer sequence data by replacing the quadratic attention with efficient transformers.
We show that ViX performs better than ViT in image classification consuming lesser computing resources.
arXiv Detail & Related papers (2021-07-05T19:24:23Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.