Swin Transformer V2: Scaling Up Capacity and Resolution
- URL: http://arxiv.org/abs/2111.09883v1
- Date: Thu, 18 Nov 2021 18:59:33 GMT
- Title: Swin Transformer V2: Scaling Up Capacity and Resolution
- Authors: Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and
Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei
and Baining Guo
- Abstract summary: We present techniques for scaling Swin Transformer up to 3 billion parameters and making it capable of training with images of up to 1,536$times$1,536 resolution.
By scaling up capacity and resolution, Swin Transformer sets new records on four representative vision benchmarks.
- Score: 45.462916348268664
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present techniques for scaling Swin Transformer up to 3 billion parameters
and making it capable of training with images of up to 1,536$\times$1,536
resolution. By scaling up capacity and resolution, Swin Transformer sets new
records on four representative vision benchmarks: 84.0% top-1 accuracy on
ImageNet-V2 image classification, 63.1/54.4 box/mask mAP on COCO object
detection, 59.9 mIoU on ADE20K semantic segmentation, and 86.8% top-1 accuracy
on Kinetics-400 video action classification. Our techniques are generally
applicable for scaling up vision models, which has not been widely explored as
that of NLP language models, partly due to the following difficulties in
training and applications: 1) vision models often face instability issues at
scale and 2) many downstream vision tasks require high resolution images or
windows and it is not clear how to effectively transfer models pre-trained at
low resolutions to higher resolution ones. The GPU memory consumption is also a
problem when the image resolution is high. To address these issues, we present
several techniques, which are illustrated by using Swin Transformer as a case
study: 1) a post normalization technique and a scaled cosine attention approach
to improve the stability of large vision models; 2) a log-spaced continuous
position bias technique to effectively transfer models pre-trained at
low-resolution images and windows to their higher-resolution counterparts. In
addition, we share our crucial implementation details that lead to significant
savings of GPU memory consumption and thus make it feasible to train large
vision models with regular GPUs. Using these techniques and self-supervised
pre-training, we successfully train a strong 3B Swin Transformer model and
effectively transfer it to various vision tasks involving high-resolution
images or windows, achieving the state-of-the-art accuracy on a variety of
benchmarks.
Related papers
- ViTAR: Vision Transformer with Any Resolution [80.95324692984903]
Vision Transformers experience a performance decline when processing resolutions different from those seen during training.
We introduce fuzzy positional encoding in the Vision Transformer to provide consistent positional awareness across multiple resolutions.
Our resulting model, ViTAR, demonstrates impressive adaptability, achieving 83.3% top-1 accuracy at a 1120x1120 resolution and 80.4% accuracy at a 4032x4032 resolution.
arXiv Detail & Related papers (2024-03-27T08:53:13Z) - xT: Nested Tokenization for Larger Context in Large Images [79.37673340393475]
xT is a framework for vision transformers which aggregates global context with local details.
We are able to increase accuracy by up to 8.6% on challenging classification tasks.
arXiv Detail & Related papers (2024-03-04T10:29:58Z) - MULLER: Multilayer Laplacian Resizer for Vision [16.67232499096539]
We present an extremely lightweight multilayer Laplacian resizer with only a handful of trainable parameters, dubbed MULLER resizer.
We show that MULLER can be easily plugged into various training pipelines, and it effectively boosts the performance of the underlying vision task with little to no extra cost.
arXiv Detail & Related papers (2023-04-06T04:39:21Z) - PatchDropout: Economizing Vision Transformers Using Patch Dropout [9.243684409949436]
We show that standard ViT models can be efficiently trained at high resolution by randomly dropping input image patches.
We observe a 5 times savings in computation and memory using PatchDropout, along with a boost in performance.
arXiv Detail & Related papers (2022-08-10T14:08:55Z) - Advancing Plain Vision Transformer Towards Remote Sensing Foundation
Model [97.9548609175831]
We resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models customized for remote sensing tasks.
Specifically, to handle the large image size and objects of various orientations in RS images, we propose a new rotated varied-size window attention.
Experiments on detection tasks demonstrate the superiority of our model over all state-of-the-art models, achieving 81.16% mAP on the DOTA-V1.0 dataset.
arXiv Detail & Related papers (2022-08-08T09:08:40Z) - Focal Self-attention for Local-Global Interactions in Vision
Transformers [90.9169644436091]
We present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions.
With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers.
arXiv Detail & Related papers (2021-07-01T17:56:09Z) - Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute.
We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy.
The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.