A Survey on Visual Transformer
- URL: http://arxiv.org/abs/2012.12556v6
- Date: Mon, 10 Jul 2023 13:54:04 GMT
- Title: A Survey on Visual Transformer
- Authors: Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua
Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang,
Dacheng Tao
- Abstract summary: Transformer is a type of deep neural network mainly based on the self-attention mechanism.
In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
- Score: 126.56860258176324
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer, first applied to the field of natural language processing, is a
type of deep neural network mainly based on the self-attention mechanism.
Thanks to its strong representation capabilities, researchers are looking at
ways to apply transformer to computer vision tasks. In a variety of visual
benchmarks, transformer-based models perform similar to or better than other
types of networks such as convolutional and recurrent neural networks. Given
its high performance and less need for vision-specific inductive bias,
transformer is receiving more and more attention from the computer vision
community. In this paper, we review these vision transformer models by
categorizing them in different tasks and analyzing their advantages and
disadvantages. The main categories we explore include the backbone network,
high/mid-level vision, low-level vision, and video processing. We also include
efficient transformer methods for pushing transformer into real device-based
applications. Furthermore, we also take a brief look at the self-attention
mechanism in computer vision, as it is the base component in transformer.
Toward the end of this paper, we discuss the challenges and provide several
further research directions for vision transformers.
Related papers
- ViTs are Everywhere: A Comprehensive Study Showcasing Vision
Transformers in Different Domain [0.0]
Vision Transformers (ViTs) are becoming more popular and dominant solutions for many vision problems.
ViTs can overcome several possible difficulties with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2023-10-09T12:31:30Z) - Interpret Vision Transformers as ConvNets with Dynamic Convolutions [70.59235381143831]
We interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework.
Our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets.
arXiv Detail & Related papers (2023-09-19T16:00:49Z) - A survey of the Vision Transformers and their CNN-Transformer based Variants [0.48163317476588563]
Vision transformers have become popular as a possible substitute to convolutional neural networks (CNNs) for a variety of computer vision applications.
These transformers, with their ability to focus on global relationships in images, offer large learning capacity.
Recently, in vision transformers hybridization of both the convolution operation and self-attention mechanism has emerged, to exploit both the local and global image representations.
arXiv Detail & Related papers (2023-05-17T01:27:27Z) - Advances in Medical Image Analysis with Vision Transformers: A
Comprehensive Review [6.953789750981636]
We provide an encyclopedic review of the applications of Transformers in medical imaging.
Specifically, we present a systematic and thorough review of relevant recent Transformer literature for different medical image analysis tasks.
arXiv Detail & Related papers (2023-01-09T16:56:23Z) - Vision Transformers for Action Recognition: A Survey [41.69370782177517]
Vision transformers are emerging as a powerful tool to solve computer vision problems.
Recent techniques have proven the efficacy of transformers beyond the image domain to solve numerous video-related tasks.
Human action recognition is receiving special attention from the research community due to its widespread applications.
arXiv Detail & Related papers (2022-09-13T02:57:05Z) - 3D Vision with Transformers: A Survey [114.86385193388439]
The success of the transformer architecture in natural language processing has triggered attention in the computer vision field.
We present a systematic and thorough review of more than 100 transformers methods for different 3D vision tasks.
We discuss transformer design in 3D vision, which allows it to process data with various 3D representations.
arXiv Detail & Related papers (2022-08-08T17:59:11Z) - Can Vision Transformers Perform Convolution? [78.42076260340869]
We prove that a single ViT layer with image patches as the input can perform any convolution operation constructively.
We provide a lower bound on the number of heads for Vision Transformers to express CNNs.
arXiv Detail & Related papers (2021-11-02T03:30:17Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.