So-ViT: Mind Visual Tokens for Vision Transformer
- URL: http://arxiv.org/abs/2104.10935v1
- Date: Thu, 22 Apr 2021 09:05:09 GMT
- Title: So-ViT: Mind Visual Tokens for Vision Transformer
- Authors: Jiangtao Xie, Ruiren Zeng, Qilong Wang, Ziqi Zhou, Peihua Li
- Abstract summary: We propose a new classification paradigm, where the second-order, cross-covariance pooling of visual tokens is combined with class token for final classification.
We develop a light-weight, hierarchical module based on off-the-shelf convolutions for visual token embedding.
The results show our models, when trained from scratch, outperform the competing ViT variants, while being on par with or better than state-of-the-art CNN models.
- Score: 27.243241133304785
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recently the vision transformer (ViT) architecture, where the backbone purely
consists of self-attention mechanism, has achieved very promising performance
in visual classification. However, the high performance of the original ViT
heavily depends on pretraining using ultra large-scale datasets, and it
significantly underperforms on ImageNet-1K if trained from scratch. This paper
makes the efforts toward addressing this problem, by carefully considering the
role of visual tokens. First, for classification head, existing ViT only
exploits class token while entirely neglecting rich semantic information
inherent in high-level visual tokens. Therefore, we propose a new
classification paradigm, where the second-order, cross-covariance pooling of
visual tokens is combined with class token for final classification. Meanwhile,
a fast singular value power normalization is proposed for improving the
second-order pooling. Second, the original ViT employs the naive embedding of
fixed-size image patches, lacking the ability to model translation equivariance
and locality. To alleviate this problem, we develop a light-weight,
hierarchical module based on off-the-shelf convolutions for visual token
embedding. The proposed architecture, which we call So-ViT, is thoroughly
evaluated on ImageNet-1K. The results show our models, when trained from
scratch, outperform the competing ViT variants, while being on par with or
better than state-of-the-art CNN models. Code is available at
https://github.com/jiangtaoxie/So-ViT
Related papers
- A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis [9.687982148528187]
Convolutional Neural Networks (CNNs) are currently among the best texture analysis approaches.
Vision Transformers (ViTs) have been surpassing the performance of CNNs on tasks such as object recognition.
This work explores various pre-trained ViT architectures when transferred to tasks that rely on textures.
arXiv Detail & Related papers (2024-06-10T09:48:13Z) - Rethinking Hierarchicies in Pre-trained Plain Vision Transformer [76.35955924137986]
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective.
customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT.
This paper proposes a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training.
arXiv Detail & Related papers (2022-11-03T13:19:23Z) - Patch-level Representation Learning for Self-supervised Vision
Transformers [68.8862419248863]
Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks.
Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations.
We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
arXiv Detail & Related papers (2022-06-16T08:01:19Z) - ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator [21.351034332423374]
We propose a novel ViT based fine-grained object discriminator for Fine-Grained Visual Classification (FGVC) tasks.
Besides a ViT backbone, it introduces three novel components, i.e. Attention Patch Combination (APC), Critical Regions Filter (CRF) and Complementary Tokens Integration (CTI)
We conduct comprehensive experiments on widely used datasets and the results demonstrate that ViT-FOD is able to achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-03-24T02:34:57Z) - PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers [9.63371509052453]
This paper proposes to learn Patch-to-Cluster attention (PaCa) in Vision Transformers (ViT)
The proposed PaCa module is used in designing efficient and interpretable ViT backbones and semantic segmentation head networks.
It is significantly more efficient than PVT models in MS-COCO and MIT-ADE20k due to the linear complexity.
arXiv Detail & Related papers (2022-03-22T18:28:02Z) - VOLO: Vision Outlooker for Visual Recognition [148.12522298731807]
Vision transformers (ViTs) have shown great potential of self-attention based models in ImageNet classification.
We introduce a novel outlook attention and present a simple and general architecture, termed Vision Outlooker (VOLO)
Unlike self-attention that focuses on global dependency modeling at a coarse level, the outlook attention efficiently encodes finer-level features and contexts into tokens.
Experiments show that our VOLO achieves 87.1% top-1 accuracy on ImageNet-1K classification, which is the first model exceeding 87% accuracy on this competitive benchmark.
arXiv Detail & Related papers (2021-06-24T15:46:54Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z) - Tokens-to-Token ViT: Training Vision Transformers from Scratch on
ImageNet [128.96032932640364]
We propose a new Tokens-To-Token Vision Transformers (T2T-ViT) to solve vision tasks.
T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200%, while achieving more than 2.5% improvement when trained from scratch on ImageNet.
For example, T2T-ViT with ResNet50 comparable size can achieve 80.7% top-1 accuracy on ImageNet.
arXiv Detail & Related papers (2021-01-28T13:25:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.