Rethinking Spatial Dimensions of Vision Transformers
- URL: http://arxiv.org/abs/2103.16302v1
- Date: Tue, 30 Mar 2021 12:51:28 GMT
- Title: Rethinking Spatial Dimensions of Vision Transformers
- Authors: Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe,
Seong Joon Oh
- Abstract summary: Vision Transformer (ViT) extends the application range of transformers from language processing to computer vision tasks.
We investigate the role of the spatial dimension conversion and its effectiveness on the transformer-based architecture.
We propose a novel Pooling-based Vision Transformer (PiT) upon the original ViT model.
- Score: 34.13899937264952
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Vision Transformer (ViT) extends the application range of transformers from
language processing to computer vision tasks as being an alternative
architecture against the existing convolutional neural networks (CNN). Since
the transformer-based architecture has been innovative for computer vision
modeling, the design convention towards an effective architecture has been less
studied yet. From the successful design principles of CNN, we investigate the
role of the spatial dimension conversion and its effectiveness on the
transformer-based architecture. We particularly attend the dimension reduction
principle of CNNs; as the depth increases, a conventional CNN increases channel
dimension and decreases spatial dimensions. We empirically show that such a
spatial dimension reduction is beneficial to a transformer architecture as
well, and propose a novel Pooling-based Vision Transformer (PiT) upon the
original ViT model. We show that PiT achieves the improved model capability and
generalization performance against ViT. Throughout the extensive experiments,
we further show PiT outperforms the baseline on several tasks such as image
classification, object detection and robustness evaluation. Source codes and
ImageNet models are available at https://github.com/naver-ai/pit
Related papers
- Self-Supervised Pre-Training for Table Structure Recognition Transformer [25.04573593082671]
We propose a self-supervised pre-training (SSP) method for table structure recognition transformers.
We discover that the performance gap between the linear projection transformer and the hybrid CNN-transformer can be mitigated by SSP of the visual encoder in the TSR model.
arXiv Detail & Related papers (2024-02-23T19:34:06Z) - PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - Interpret Vision Transformers as ConvNets with Dynamic Convolutions [70.59235381143831]
We interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework.
Our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets.
arXiv Detail & Related papers (2023-09-19T16:00:49Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - Swin-Pose: Swin Transformer Based Human Pose Estimation [16.247836509380026]
Convolutional neural networks (CNNs) have been widely utilized in many computer vision tasks.
CNNs have a fixed reception field and lack the ability of long-range perception, which is crucial to human pose estimation.
We propose a novel model based on transformer architecture, enhanced with a feature pyramid fusion structure.
arXiv Detail & Related papers (2022-01-19T02:15:26Z) - Rethinking the Design Principles of Robust Vision Transformer [28.538786330184642]
Vision Transformers (ViT) have shown that self-attention-based networks surpassed traditional convolution neural networks (CNNs) in most vision tasks.
In this paper, we rethink the design principles of ViTs based on the robustness.
By combining the robust design components, we propose Robust Vision Transformer (RVT)
arXiv Detail & Related papers (2021-05-17T15:04:15Z) - Multiscale Vision Transformers [79.76412415996892]
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models.
We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks.
arXiv Detail & Related papers (2021-04-22T17:59:45Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z) - A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism.
In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.