Rethinking Hierarchicies in Pre-trained Plain Vision Transformer
- URL: http://arxiv.org/abs/2211.01785v1
- Date: Thu, 3 Nov 2022 13:19:23 GMT
- Title: Rethinking Hierarchicies in Pre-trained Plain Vision Transformer
- Authors: Yufei Xu, Jing Zhang, Qiming Zhang and Dacheng Tao
- Abstract summary: Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective.
customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT.
This paper proposes a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training.
- Score: 76.35955924137986
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised pre-training vision transformer (ViT) via masked image
modeling (MIM) has been proven very effective. However, customized algorithms
should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead
of using the vanilla and simple MAE for the plain ViT. More importantly, since
these hierarchical ViTs cannot reuse the off-the-shelf pre-trained weights of
the plain ViTs, the requirement of pre-training them leads to a massive amount
of computational cost, thereby incurring both algorithmic and computational
complexity. In this paper, we address this problem by proposing a novel idea of
disentangling the hierarchical architecture design from the self-supervised
pre-training. We transform the plain ViT into a hierarchical one with minimal
changes. Technically, we change the stride of linear embedding layer from 16 to
4 and add convolution (or simple average) pooling layers between the
transformer blocks, thereby reducing the feature size from 1/4 to 1/32
sequentially. Despite its simplicity, it outperforms the plain ViT baseline in
classification, detection, and segmentation tasks on ImageNet, MS COCO,
Cityscapes, and ADE20K benchmarks, respectively. We hope this preliminary study
could draw more attention from the community on developing effective
(hierarchical) ViTs while avoiding the pre-training cost by leveraging the
off-the-shelf checkpoints. The code and models will be released at
https://github.com/ViTAE-Transformer/HPViT.
Related papers
- PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks.
We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs.
Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z) - So-ViT: Mind Visual Tokens for Vision Transformer [27.243241133304785]
We propose a new classification paradigm, where the second-order, cross-covariance pooling of visual tokens is combined with class token for final classification.
We develop a light-weight, hierarchical module based on off-the-shelf convolutions for visual token embedding.
The results show our models, when trained from scratch, outperform the competing ViT variants, while being on par with or better than state-of-the-art CNN models.
arXiv Detail & Related papers (2021-04-22T09:05:09Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.