Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
- URL: http://arxiv.org/abs/2306.00989v1
- Date: Thu, 1 Jun 2023 17:59:58 GMT
- Title: Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
- Authors: Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan,
Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy
Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer
- Abstract summary: In this paper, we argue that this additional bulk is unnecessary.
By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer.
We create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models.
- Score: 65.54857068975068
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern hierarchical vision transformers have added several vision-specific
components in the pursuit of supervised classification performance. While these
components lead to effective accuracies and attractive FLOP counts, the added
complexity actually makes these transformers slower than their vanilla ViT
counterparts. In this paper, we argue that this additional bulk is unnecessary.
By pretraining with a strong visual pretext task (MAE), we can strip out all
the bells-and-whistles from a state-of-the-art multi-stage vision transformer
without losing accuracy. In the process, we create Hiera, an extremely simple
hierarchical vision transformer that is more accurate than previous models
while being significantly faster both at inference and during training. We
evaluate Hiera on a variety of tasks for image and video recognition. Our code
and models are available at https://github.com/facebookresearch/hiera.
Related papers
- Reversible Vision Transformers [74.3500977090597]
Reversible Vision Transformers are a memory efficient architecture for visual recognition.
We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants.
We find that the additional computational burden of recomputing activations is more than overcome for deeper models.
arXiv Detail & Related papers (2023-02-09T18:59:54Z) - Holistically Explainable Vision Transformers [136.27303006772294]
We propose B-cos transformers, which inherently provide holistic explanations for their decisions.
Specifically, we formulate each model component - such as the multi-layer perceptrons, attention layers, and the tokenisation module - to be dynamic linear.
We apply our proposed design to Vision Transformers (ViTs) and show that the resulting models, dubbed Bcos-ViTs, are highly interpretable and perform competitively to baseline ViTs.
arXiv Detail & Related papers (2023-01-20T16:45:34Z) - Boosting vision transformers for image retrieval [11.441395750267052]
Vision transformers have achieved remarkable progress in vision tasks such as image classification and detection.
However, in instance-level image retrieval, transformers have not yet shown good performance compared to convolutional networks.
We propose a number of improvements that make transformers outperform the state of the art for the first time.
arXiv Detail & Related papers (2022-10-21T12:17:12Z) - On the Surprising Effectiveness of Transformers in Low-Labeled Video
Recognition [18.557920268145818]
Video vision transformers have been shown to be competitive with convolution-based methods (CNNs) broadly across multiple vision tasks.
Our work empirically explores the low data regime for video classification and discovers that, surprisingly, transformers perform extremely well in the low-labeled video setting.
We even show that using just the labeled data, transformers significantly outperform complex semi-supervised CNN methods that leverage large-scale unlabeled data as well.
arXiv Detail & Related papers (2022-09-15T17:12:30Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - ATS: Adaptive Token Sampling For Efficient Vision Transformers [33.297806854292155]
We introduce a differentiable parameter-free Adaptive Token Sampling (ATS) module, which can be plugged into any existing vision transformer architecture.
ATS empowers vision transformers by scoring and adaptively sampling significant tokens.
Our evaluations show that the proposed module improves the state-of-the-art by reducing the computational cost (GFLOPs) by 37% while preserving the accuracy.
arXiv Detail & Related papers (2021-11-30T18:56:57Z) - ConvNets vs. Transformers: Whose Visual Representations are More
Transferable? [49.62201738334348]
We investigate the transfer learning ability of ConvNets and vision transformers in 15 single-task and multi-task performance evaluations.
We observe consistent advantages of Transformer-based backbones on 13 downstream tasks.
arXiv Detail & Related papers (2021-08-11T16:20:38Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Glance-and-Gaze Vision Transformer [13.77016463781053]
We propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer)
It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes.
We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers.
arXiv Detail & Related papers (2021-06-04T06:13:47Z) - Improve Vision Transformers Training by Suppressing Over-smoothing [28.171262066145612]
Introducing the transformer structure into computer vision tasks holds the promise of yielding a better speed-accuracy trade-off than traditional convolution networks.
However, directly training vanilla transformers on vision tasks has been shown to yield unstable and sub-optimal results.
Recent works propose to modify transformer structures by incorporating convolutional layers to improve the performance on vision tasks.
arXiv Detail & Related papers (2021-04-26T17:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.