VOLO: Vision Outlooker for Visual Recognition
- URL: http://arxiv.org/abs/2106.13112v2
- Date: Mon, 28 Jun 2021 14:40:33 GMT
- Title: VOLO: Vision Outlooker for Visual Recognition
- Authors: Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, Shuicheng Yan
- Abstract summary: Vision transformers (ViTs) have shown great potential of self-attention based models in ImageNet classification.
We introduce a novel outlook attention and present a simple and general architecture, termed Vision Outlooker (VOLO)
Unlike self-attention that focuses on global dependency modeling at a coarse level, the outlook attention efficiently encodes finer-level features and contexts into tokens.
Experiments show that our VOLO achieves 87.1% top-1 accuracy on ImageNet-1K classification, which is the first model exceeding 87% accuracy on this competitive benchmark.
- Score: 148.12522298731807
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual recognition has been dominated by convolutional neural networks (CNNs)
for years. Though recently the prevailing vision transformers (ViTs) have shown
great potential of self-attention based models in ImageNet classification,
their performance is still inferior to that of the latest SOTA CNNs if no extra
data are provided. In this work, we try to close the performance gap and
demonstrate that attention-based models are indeed able to outperform CNNs. We
find a major factor limiting the performance of ViTs for ImageNet
classification is their low efficacy in encoding fine-level features into the
token representations. To resolve this, we introduce a novel outlook attention
and present a simple and general architecture, termed Vision Outlooker (VOLO).
Unlike self-attention that focuses on global dependency modeling at a coarse
level, the outlook attention efficiently encodes finer-level features and
contexts into tokens, which is shown to be critically beneficial to recognition
performance but largely ignored by the self-attention. Experiments show that
our VOLO achieves 87.1% top-1 accuracy on ImageNet-1K classification, which is
the first model exceeding 87% accuracy on this competitive benchmark, without
using any extra training data In addition, the pre-trained VOLO transfers well
to downstream tasks, such as semantic segmentation. We achieve 84.3% mIoU score
on the cityscapes validation set and 54.3% on the ADE20K validation set. Code
is available at \url{https://github.com/sail-sg/volo}.
Related papers
- Combined CNN and ViT features off-the-shelf: Another astounding baseline for recognition [49.14350399025926]
We apply pre-trained architectures, originally developed for the ImageNet Large Scale Visual Recognition Challenge, for periocular recognition.
Middle-layer features from CNNs and ViTs are a suitable way to recognize individuals based on periocular images.
arXiv Detail & Related papers (2024-07-28T11:52:36Z) - SeiT++: Masked Token Modeling Improves Storage-efficient Training [36.95646819348317]
Recent advancements in Deep Neural Network (DNN) models have significantly improved performance across computer vision tasks.
achieving highly generalizable and high-performing vision models requires expansive datasets, resulting in significant storage requirements.
Recent breakthrough by SeiT proposed the use of Vector-Quantized (VQ) feature vectors (i.e., tokens) as network inputs for vision classification.
In this paper, we extend SeiT by integrating Masked Token Modeling (MTM) for self-supervised pre-training.
arXiv Detail & Related papers (2023-12-15T04:11:34Z) - FasterViT: Fast Vision Transformers with Hierarchical Attention [63.50580266223651]
We design a new family of hybrid CNN-ViT neural networks, named FasterViT, with a focus on high image throughput for computer vision (CV) applications.
Our newly introduced Hierarchical Attention (HAT) approach decomposes global self-attention with quadratic complexity into a multi-level attention with reduced computational costs.
arXiv Detail & Related papers (2023-06-09T18:41:37Z) - ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders [104.05133094625137]
We propose a fully convolutional masked autoencoder framework and a new Global Response Normalization layer.
This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets.
arXiv Detail & Related papers (2023-01-02T18:59:31Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - Understanding The Robustness in Vision Transformers [140.1090560977082]
Self-attention may promote robustness through improved mid-level representations.
We propose a family of fully attentional networks (FANs) that strengthen this capability.
Our model achieves a state of-the-art 87.1% accuracy and 35.8% mCE on ImageNet-1k and ImageNet-C with 76.8M parameters.
arXiv Detail & Related papers (2022-04-26T17:16:32Z) - Attend and Guide (AG-Net): A Keypoints-driven Attention-based Deep
Network for Image Recognition [13.230646408771868]
We propose an end-to-end CNN model, which learns meaningful features linking fine-grained changes using our novel attention mechanism.
It captures the spatial structures in images by identifying semantic regions (SRs) and their spatial distributions, and is proved to be the key to modelling subtle changes in images.
The framework is evaluated on six diverse benchmark datasets.
arXiv Detail & Related papers (2021-10-23T09:43:36Z) - So-ViT: Mind Visual Tokens for Vision Transformer [27.243241133304785]
We propose a new classification paradigm, where the second-order, cross-covariance pooling of visual tokens is combined with class token for final classification.
We develop a light-weight, hierarchical module based on off-the-shelf convolutions for visual token embedding.
The results show our models, when trained from scratch, outperform the competing ViT variants, while being on par with or better than state-of-the-art CNN models.
arXiv Detail & Related papers (2021-04-22T09:05:09Z) - Compounding the Performance Improvements of Assembled Techniques in a
Convolutional Neural Network [6.938261599173859]
We show how to improve the accuracy and robustness of basic CNN models.
Our proposed assembled ResNet-50 shows improvements in top-1 accuracy from 76.3% to 82.78%, mCE from 76.0% to 48.9% and mFR from 57.7% to 32.3%.
Our approach achieved 1st place in the iFood Competition Fine-Grained Visual Recognition at CVPR 2019.
arXiv Detail & Related papers (2020-01-17T12:42:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.