A survey of the Vision Transformers and its CNN-Transformer based
Variants
- URL: http://arxiv.org/abs/2305.09880v3
- Date: Tue, 8 Aug 2023 07:02:16 GMT
- Title: A survey of the Vision Transformers and its CNN-Transformer based
Variants
- Authors: Asifullah Khan, Zunaira Rauf, Anabia Sohail, Abdul Rehman, Hifsa Asif,
Aqsa Asif, and Umair Farooq
- Abstract summary: Vision transformers have become popular as a possible substitute to convolutional neural networks (CNNs) for a variety of computer vision applications.
These transformers, with their ability to focus on global relationships in images, offer large learning capacity.
Recently, in vision transformers hybridization of both the convolution operation and self-attention mechanism has emerged, to exploit both the local and global image representations.
- Score: 0.5540875567089276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers have become popular as a possible substitute to
convolutional neural networks (CNNs) for a variety of computer vision
applications. These transformers, with their ability to focus on global
relationships in images, offer large learning capacity. However, they may
suffer from limited generalization as they do not tend to model local
correlation in images. Recently, in vision transformers hybridization of both
the convolution operation and self-attention mechanism has emerged, to exploit
both the local and global image representations. These hybrid vision
transformers, also referred to as CNN-Transformer architectures, have
demonstrated remarkable results in vision applications. Given the rapidly
growing number of hybrid vision transformers, it has become necessary to
provide a taxonomy and explanation of these hybrid architectures. This survey
presents a taxonomy of the recent vision transformer architectures and more
specifically that of the hybrid vision transformers. Additionally, the key
features of these architectures such as the attention mechanisms, positional
embeddings, multi-scale processing, and convolution are also discussed. In
contrast to the previous survey papers that are primarily focused on individual
vision transformer architectures or CNNs, this survey uniquely emphasizes the
emerging trend of hybrid vision transformers. By showcasing the potential of
hybrid vision transformers to deliver exceptional performance across a range of
computer vision tasks, this survey sheds light on the future directions of this
rapidly evolving architecture.
Related papers
- ViT-LCA: A Neuromorphic Approach for Vision Transformers [0.0]
This paper introduces a novel model that combines vision transformers with the Locally Competitive Algorithm (LCA) to facilitate efficient neuromorphic deployment.
Our experiments show that ViT-LCA achieves higher accuracy on ImageNet-1K dataset while consuming significantly less energy than other spiking vision transformer counterparts.
arXiv Detail & Related papers (2024-10-31T18:41:30Z) - Multi-Dimensional Hyena for Spatial Inductive Bias [69.3021852589771]
We present a data-efficient vision transformer that does not rely on self-attention.
Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer.
We show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures.
arXiv Detail & Related papers (2023-09-24T10:22:35Z) - Interpret Vision Transformers as ConvNets with Dynamic Convolutions [70.59235381143831]
We interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework.
Our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets.
arXiv Detail & Related papers (2023-09-19T16:00:49Z) - What Makes for Good Tokenizers in Vision Transformer? [62.44987486771936]
transformers are capable of extracting their pairwise relationships using self-attention.
What makes for a good tokenizer has not been well understood in computer vision.
Modulation across Tokens (MoTo) incorporates inter-token modeling capability through normalization.
Regularization objective TokenProp is embraced in the standard training regime.
arXiv Detail & Related papers (2022-12-21T15:51:43Z) - Multi-manifold Attention for Vision Transformers [12.862540139118073]
Vision Transformers are very popular nowadays due to their state-of-the-art performance in several computer vision tasks.
A novel attention mechanism, called multi-manifold multihead attention, is proposed in this work to substitute the vanilla self-attention of a Transformer.
arXiv Detail & Related papers (2022-07-18T12:53:53Z) - Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy
for Image Recognition without Convolutions [1.1032962642000486]
This work is based on Vision Transformer, combined with the pyramid architecture, using Split-merge-transform to propose the group encoder and name the network architecture Aggregated Pyramid Vision Transformer (APVT)
We perform image classification tasks on the CIFAR-10 dataset and object detection tasks on the COCO 2017 dataset.
arXiv Detail & Related papers (2022-03-02T09:14:28Z) - Can Vision Transformers Perform Convolution? [78.42076260340869]
We prove that a single ViT layer with image patches as the input can perform any convolution operation constructively.
We provide a lower bound on the number of heads for Vision Transformers to express CNNs.
arXiv Detail & Related papers (2021-11-02T03:30:17Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z) - A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism.
In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.