Vision Transformers: State of the Art and Research Challenges
- URL: http://arxiv.org/abs/2207.03041v1
- Date: Thu, 7 Jul 2022 02:01:56 GMT
- Title: Vision Transformers: State of the Art and Research Challenges
- Authors: Bo-Kai Ruan, Hong-Han Shuai, Wen-Huang Cheng
- Abstract summary: This paper presents a comprehensive overview of the literature on different architecture designs and training tricks for vision transformers.
Our goal is to provide a systematic review with the open research opportunities.
- Score: 26.462994554165697
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Transformers have achieved great success in natural language processing. Due
to the powerful capability of self-attention mechanism in transformers,
researchers develop the vision transformers for a variety of computer vision
tasks, such as image recognition, object detection, image segmentation, pose
estimation, and 3D reconstruction. This paper presents a comprehensive overview
of the literature on different architecture designs and training tricks
(including self-supervised learning) for vision transformers. Our goal is to
provide a systematic review with the open research opportunities.
Related papers
- Adventures of Trustworthy Vision-Language Models: A Survey [54.76511683427566]
This paper conducts a thorough examination of vision-language transformers, employing three fundamental principles of responsible AI: Bias, Robustness, and Interpretability.
The primary objective of this paper is to delve into the intricacies and complexities associated with the practical use of transformers, with the overarching goal of advancing our comprehension of how to enhance their reliability and accountability.
arXiv Detail & Related papers (2023-12-07T11:31:20Z) - Learning Explicit Object-Centric Representations with Vision
Transformers [81.38804205212425]
We build on the self-supervision task of masked autoencoding and explore its effectiveness for learning object-centric representations with transformers.
We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks.
arXiv Detail & Related papers (2022-10-25T16:39:49Z) - 3D Vision with Transformers: A Survey [114.86385193388439]
The success of the transformer architecture in natural language processing has triggered attention in the computer vision field.
We present a systematic and thorough review of more than 100 transformers methods for different 3D vision tasks.
We discuss transformer design in 3D vision, which allows it to process data with various 3D representations.
arXiv Detail & Related papers (2022-08-08T17:59:11Z) - A Survey of Visual Transformers [30.082304742571598]
Transformer, an attention-based encoder-decoder architecture, has revolutionized the field of natural language processing.
Some pioneering works have recently been done on adapting Transformer architectures to Computer Vision (CV) fields.
We have provided a comprehensive review of over one hundred different visual Transformers for three fundamental CV tasks.
arXiv Detail & Related papers (2021-11-11T07:56:04Z) - Self-Supervised Multi-View Learning via Auto-Encoding 3D Transformations [61.870882736758624]
We propose a novel self-supervised paradigm to learn Multi-View Transformation Equivariant Representations (MV-TER)
Specifically, we perform a 3D transformation on a 3D object, and obtain multiple views before and after the transformation via projection.
Then, we self-train a representation to capture the intrinsic 3D object representation by decoding 3D transformation parameters from the fused feature representations of multiple views before and after the transformation.
arXiv Detail & Related papers (2021-03-01T06:24:17Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z) - A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism.
In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.