Green Hierarchical Vision Transformer for Masked Image Modeling
- URL: http://arxiv.org/abs/2205.13515v1
- Date: Thu, 26 May 2022 17:34:42 GMT
- Title: Green Hierarchical Vision Transformer for Masked Image Modeling
- Authors: Lang Huang, Shan You, Mingkai Zheng, Fei Wang, Chen Qian, Toshihiko
Yamasaki
- Abstract summary: We present an efficient approach for Masked Image Modeling with hierarchical Vision Transformers (ViTs)
We design a Group Window Attention scheme following the Divide-and-Conquer strategy.
We further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall cost of the attention on the grouped patches.
- Score: 54.14989750044489
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present an efficient approach for Masked Image Modeling (MIM) with
hierarchical Vision Transformers (ViTs), e.g., Swin Transformer, allowing the
hierarchical ViTs to discard masked patches and operate only on the visible
ones. Our approach consists of two key components. First, for the window
attention, we design a Group Window Attention scheme following the
Divide-and-Conquer strategy. To mitigate the quadratic complexity of the
self-attention w.r.t. the number of patches, group attention encourages a
uniform partition that visible patches within each local window of arbitrary
size can be grouped with equal size, where masked self-attention is then
performed within each group. Second, we further improve the grouping strategy
via the Dynamic Programming algorithm to minimize the overall computation cost
of the attention on the grouped patches. As a result, MIM now can work on
hierarchical ViTs in a green and efficient way. For example, we can train the
hierarchical ViTs about 2.7$\times$ faster and reduce the GPU memory usage by
70%, while still enjoying competitive performance on ImageNet classification
and the superiority on downstream COCO object detection benchmarks. Code and
pre-trained models have been made publicly available at
https://github.com/LayneH/GreenMIM.
Related papers
- Toward a Deeper Understanding: RetNet Viewed through Convolution [25.8904146140577]
Vision Transformer (ViT) can learn global dependencies superior to CNN, yet CNN's inherent locality can substitute for expensive training resources.
This paper investigates the effectiveness of RetNet from a CNN perspective and presents a variant of RetNet tailored to the visual domain.
We propose a novel Gaussian mixture mask (GMM) in which one mask only has two learnable parameters and it can be conveniently used in any ViT variants whose attention mechanism allows the use of masks.
arXiv Detail & Related papers (2023-09-11T10:54:22Z) - Image as Set of Points [60.30495338399321]
Context clusters (CoCs) view an image as a set of unorganized points and extract features via simplified clustering algorithm.
Our CoCs are convolution- and attention-free, and only rely on clustering algorithm for spatial interaction.
arXiv Detail & Related papers (2023-03-02T18:56:39Z) - DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition [62.95223898214866]
We explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field.
With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages.
Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks.
arXiv Detail & Related papers (2023-02-03T14:59:31Z) - Rethinking Hierarchicies in Pre-trained Plain Vision Transformer [76.35955924137986]
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective.
customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT.
This paper proposes a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training.
arXiv Detail & Related papers (2022-11-03T13:19:23Z) - Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision
Transformers with Locality [28.245387355693545]
Masked AutoEncoder (MAE) has led the trends of visual self-supervision area by an elegant asymmetric encoder-decoder design.
We propose Uniform Masking (UM) to enable MAE pre-training for Pyramid-based ViTs with locality.
arXiv Detail & Related papers (2022-05-20T10:16:30Z) - Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks.
We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs.
Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z) - Tokens-to-Token ViT: Training Vision Transformers from Scratch on
ImageNet [128.96032932640364]
We propose a new Tokens-To-Token Vision Transformers (T2T-ViT) to solve vision tasks.
T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200%, while achieving more than 2.5% improvement when trained from scratch on ImageNet.
For example, T2T-ViT with ResNet50 comparable size can achieve 80.7% top-1 accuracy on ImageNet.
arXiv Detail & Related papers (2021-01-28T13:25:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.