Transformers in Vision: A Survey
- URL: http://arxiv.org/abs/2101.01169v2
- Date: Mon, 22 Feb 2021 11:40:11 GMT
- Title: Transformers in Vision: A Survey
- Authors: Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad
Shahbaz Khan, Mubarak Shah
- Abstract summary: Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
- Score: 101.07348618962111
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Astounding results from Transformer models on natural language tasks have
intrigued the vision community to study their application to computer vision
problems. Among their salient benefits, Transformers enable modeling long
dependencies between input sequence elements and support parallel processing of
sequence as compared to recurrent networks e.g., Long short-term memory (LSTM).
Different from convolutional networks, Transformers require minimal inductive
biases for their design and are naturally suited as set-functions. Furthermore,
the straightforward design of Transformers allows processing multiple
modalities (e.g., images, videos, text and speech) using similar processing
blocks and demonstrates excellent scalability to very large capacity networks
and huge datasets. These strengths have led to exciting progress on a number of
vision tasks using Transformer networks. This survey aims to provide a
comprehensive overview of the Transformer models in the computer vision
discipline. We start with an introduction to fundamental concepts behind the
success of Transformers i.e., self-attention, large-scale pre-training, and
bidirectional encoding. We then cover extensive applications of transformers in
vision including popular recognition tasks (e.g., image classification, object
detection, action recognition, and segmentation), generative modeling,
multi-modal tasks (e.g., visual-question answering, visual reasoning, and
visual grounding), video processing (e.g., activity recognition, video
forecasting), low-level vision (e.g., image super-resolution, image
enhancement, and colorization) and 3D analysis (e.g., point cloud
classification and segmentation). We compare the respective advantages and
limitations of popular techniques both in terms of architectural design and
their experimental value. Finally, we provide an analysis on open research
directions and possible future works.
Related papers
- ViTs are Everywhere: A Comprehensive Study Showcasing Vision
Transformers in Different Domain [0.0]
Vision Transformers (ViTs) are becoming more popular and dominant solutions for many vision problems.
ViTs can overcome several possible difficulties with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2023-10-09T12:31:30Z) - Vision Transformer with Quadrangle Attention [76.35955924137986]
We propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation.
Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles.
We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost.
arXiv Detail & Related papers (2023-03-27T11:13:50Z) - Holistically Explainable Vision Transformers [136.27303006772294]
We propose B-cos transformers, which inherently provide holistic explanations for their decisions.
Specifically, we formulate each model component - such as the multi-layer perceptrons, attention layers, and the tokenisation module - to be dynamic linear.
We apply our proposed design to Vision Transformers (ViTs) and show that the resulting models, dubbed Bcos-ViTs, are highly interpretable and perform competitively to baseline ViTs.
arXiv Detail & Related papers (2023-01-20T16:45:34Z) - 3D Vision with Transformers: A Survey [114.86385193388439]
The success of the transformer architecture in natural language processing has triggered attention in the computer vision field.
We present a systematic and thorough review of more than 100 transformers methods for different 3D vision tasks.
We discuss transformer design in 3D vision, which allows it to process data with various 3D representations.
arXiv Detail & Related papers (2022-08-08T17:59:11Z) - Video Transformers: A Survey [42.314208650554264]
We study the contributions and trends for adapting Transformers to model video data.
Specifically, we delve into how videos are embedded and tokenized, finding a very widspread use of large CNN backbones.
Also, we analyse the self-supervised losses used to train Video Transformers, which to date are mostly constrained to contrastive approaches.
arXiv Detail & Related papers (2022-01-16T07:31:55Z) - A Survey of Visual Transformers [30.082304742571598]
Transformer, an attention-based encoder-decoder architecture, has revolutionized the field of natural language processing.
Some pioneering works have recently been done on adapting Transformer architectures to Computer Vision (CV) fields.
We have provided a comprehensive review of over one hundred different visual Transformers for three fundamental CV tasks.
arXiv Detail & Related papers (2021-11-11T07:56:04Z) - Multiscale Vision Transformers [79.76412415996892]
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models.
We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks.
arXiv Detail & Related papers (2021-04-22T17:59:45Z) - A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism.
In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.