AutoFormer: Searching Transformers for Visual Recognition
- URL: http://arxiv.org/abs/2107.00651v1
- Date: Thu, 1 Jul 2021 17:59:30 GMT
- Title: AutoFormer: Searching Transformers for Visual Recognition
- Authors: Minghao Chen, Houwen Peng, Jianlong Fu, Haibin Ling
- Abstract summary: We propose a new one-shot architecture search framework, namely AutoFormer, dedicated to vision transformer search.
AutoFormer entangles the weights of different blocks in the same layers during supernet training.
We show that AutoFormer-tiny/small/base achieve 74.7%/81.7%/82.4% top-1 accuracy on ImageNet with 5.7M/22.9M/53.7M parameters.
- Score: 97.60915598958968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, pure transformer-based models have shown great potentials for
vision tasks such as image classification and detection. However, the design of
transformer networks is challenging. It has been observed that the depth,
embedding dimension, and number of heads can largely affect the performance of
vision transformers. Previous models configure these dimensions based upon
manual crafting. In this work, we propose a new one-shot architecture search
framework, namely AutoFormer, dedicated to vision transformer search.
AutoFormer entangles the weights of different blocks in the same layers during
supernet training. Benefiting from the strategy, the trained supernet allows
thousands of subnets to be very well-trained. Specifically, the performance of
these subnets with weights inherited from the supernet is comparable to those
retrained from scratch. Besides, the searched models, which we refer to
AutoFormers, surpass the recent state-of-the-arts such as ViT and DeiT. In
particular, AutoFormer-tiny/small/base achieve 74.7%/81.7%/82.4% top-1 accuracy
on ImageNet with 5.7M/22.9M/53.7M parameters, respectively. Lastly, we verify
the transferability of AutoFormer by providing the performance on downstream
benchmarks and distillation experiments. Code and models are available at
https://github.com/microsoft/AutoML.
Related papers
- Contrastive Learning for Multi-Object Tracking with Transformers [79.61791059432558]
We show how DETR can be turned into a MOT model by employing an instance-level contrastive loss.
Our training scheme learns object appearances while preserving detection capabilities and with little overhead.
Its performance surpasses the previous state-of-the-art by +2.6 mMOTA on the challenging BDD100K dataset.
arXiv Detail & Related papers (2023-11-14T10:07:52Z) - Reversible Vision Transformers [74.3500977090597]
Reversible Vision Transformers are a memory efficient architecture for visual recognition.
We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants.
We find that the additional computational burden of recomputing activations is more than overcome for deeper models.
arXiv Detail & Related papers (2023-02-09T18:59:54Z) - A Study on Transformer Configuration and Training Objective [33.7272660870026]
We propose Bamboo, an idea of using deeper and narrower transformer configurations for masked autoencoder training.
On ImageNet, with such a simple change in configuration, re-designed model achieves 87.1% top-1 accuracy.
On language tasks, re-designed model outperforms BERT with default setting by 1.1 points on average.
arXiv Detail & Related papers (2022-05-21T05:17:11Z) - Efficient Visual Tracking with Exemplar Transformers [98.62550635320514]
We introduce the Exemplar Transformer, an efficient transformer for real-time visual object tracking.
E.T.Track, our visual tracker that incorporates Exemplar Transformer layers, runs at 47 fps on a CPU.
This is up to 8 times faster than other transformer-based models.
arXiv Detail & Related papers (2021-12-17T18:57:54Z) - Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute.
We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy.
The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z) - Self-Supervised Learning with Swin Transformers [24.956637957269926]
We present a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture.
The approach basically has no new inventions, which is combined from MoCo v2 and BYOL.
The performance is slightly better than recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks.
arXiv Detail & Related papers (2021-05-10T17:59:45Z) - Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with
56M Parameters on ImageNet [86.95679590801494]
We explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques.
We show that by slightly tuning the structure of vision transformers and introducing token labeling, our models are able to achieve better results than the CNN counterparts.
arXiv Detail & Related papers (2021-04-22T04:43:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.