MaxViT: Multi-Axis Vision Transformer
- URL: http://arxiv.org/abs/2204.01697v1
- Date: Mon, 4 Apr 2022 17:59:44 GMT
- Title: MaxViT: Multi-Axis Vision Transformer
- Authors: Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar,
Alan Bovik, Yinxiao Li
- Abstract summary: We introduce an efficient and scalable attention model we call multi-axis attention.
We present a new architectural element by effectively blending our proposed attention model with convolutions.
We demonstrate the effectiveness of our model on a broad spectrum of vision tasks.
- Score: 19.192826213493838
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Transformers have recently gained significant attention in the computer
vision community. However, the lack of scalability of self-attention mechanisms
with respect to image size has limited their wide adoption in state-of-the-art
vision backbones. In this paper we introduce an efficient and scalable
attention model we call multi-axis attention, which consists of two aspects:
blocked local and dilated global attention. These design choices allow
global-local spatial interactions on arbitrary input resolutions with only
linear complexity. We also present a new architectural element by effectively
blending our proposed attention model with convolutions, and accordingly
propose a simple hierarchical vision backbone, dubbed MaxViT, by simply
repeating the basic building block over multiple stages. Notably, MaxViT is
able to "see" globally throughout the entire network, even in earlier,
high-resolution stages. We demonstrate the effectiveness of our model on a
broad spectrum of vision tasks. On image classification, MaxViT achieves
state-of-the-art performance under various settings: without extra data, MaxViT
attains 86.5\% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training, our
model achieves 88.7\% top-1 accuracy. For downstream tasks, MaxViT as a
backbone delivers favorable performance on object detection as well as visual
aesthetic assessment. We also show that our proposed model expresses strong
generative modeling capability on ImageNet, demonstrating the superior
potential of MaxViT blocks as a universal vision module. We will make the code
and models publicly available.
Related papers
- Vision Mamba: Efficient Visual Representation Learning with
Bidirectional State Space Model [51.10876815815515]
We propose a new generic vision backbone with bidirectional Mamba blocks (Vim)
Vim marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models.
The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images.
arXiv Detail & Related papers (2024-01-17T18:56:18Z) - MaxSR: Image Super-Resolution Using Improved MaxViT [34.53995225219387]
We present a single image super-resolution model based on recent hybrid vision transformer of MaxViT, named as MaxSR.
Our proposed model for classical single image super-resolution (MaxSR) and lightweight single image super-resolution (MaxSR-light) establish new state-of-the-art performance efficiently.
arXiv Detail & Related papers (2023-07-14T09:26:47Z) - MULLER: Multilayer Laplacian Resizer for Vision [16.67232499096539]
We present an extremely lightweight multilayer Laplacian resizer with only a handful of trainable parameters, dubbed MULLER resizer.
We show that MULLER can be easily plugged into various training pipelines, and it effectively boosts the performance of the underlying vision task with little to no extra cost.
arXiv Detail & Related papers (2023-04-06T04:39:21Z) - You Only Train Once: Multi-Identity Free-Viewpoint Neural Human
Rendering from Monocular Videos [10.795522875068073]
You Only Train Once (YOTO) is a dynamic human generation framework, which performs free-viewpoint rendering of different human identities with distinct motions.
In this paper, we propose a set of learnable identity codes to expand the capability of the framework for multi-identity free-viewpoint rendering.
YOTO shows state-of-the-art performance on all evaluation metrics while showing significant benefits in training and inference efficiency as well as rendering quality.
arXiv Detail & Related papers (2023-03-10T10:23:17Z) - A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism.
We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions.
With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z) - Advancing Plain Vision Transformer Towards Remote Sensing Foundation
Model [97.9548609175831]
We resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models customized for remote sensing tasks.
Specifically, to handle the large image size and objects of various orientations in RS images, we propose a new rotated varied-size window attention.
Experiments on detection tasks demonstrate the superiority of our model over all state-of-the-art models, achieving 81.16% mAP on the DOTA-V1.0 dataset.
arXiv Detail & Related papers (2022-08-08T09:08:40Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Florence: A New Foundation Model for Computer Vision [97.26333007250142]
We introduce a new computer vision foundation model, Florence, to expand the representations from coarse (scene) to fine (object)
By incorporating universal visual-language representations from Web-scale image-text data, our Florence model can be easily adapted for various computer vision tasks.
Florence achieves new state-of-the-art results in majority of 44 representative benchmarks.
arXiv Detail & Related papers (2021-11-22T18:59:55Z) - Multi-Scale Vision Longformer: A New Vision Transformer for
High-Resolution Image Encoding [81.07894629034767]
This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer.
It significantly enhances the ViT of citedosovitskiy 2020image for encoding high-resolution images using two techniques.
arXiv Detail & Related papers (2021-03-29T06:23:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.