A Comprehensive Survey of Transformers for Computer Vision
- URL: http://arxiv.org/abs/2211.06004v1
- Date: Fri, 11 Nov 2022 05:11:03 GMT
- Title: A Comprehensive Survey of Transformers for Computer Vision
- Authors: Sonain Jamil, Md. Jalil Piran, and Oh-Jin Kwon
- Abstract summary: Vision Transformers (ViTs) are used to various computer vision applications (CV)
This survey is the first of its kind on ViTs for CVs to the best of our knowledge.
CV applications include image classification, object detection, image segmentation, image compression, image super-resolution, image denoising, and anomaly detection.
- Score: 3.1761172592339375
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As a special type of transformer, Vision Transformers (ViTs) are used to
various computer vision applications (CV), such as image recognition. There are
several potential problems with convolutional neural networks (CNNs) that can
be solved with ViTs. For image coding tasks like compression, super-resolution,
segmentation, and denoising, different variants of the ViTs are used. The
purpose of this survey is to present the first application of ViTs in CV. The
survey is the first of its kind on ViTs for CVs to the best of our knowledge.
In the first step, we classify different CV applications where ViTs are
applicable. CV applications include image classification, object detection,
image segmentation, image compression, image super-resolution, image denoising,
and anomaly detection. Our next step is to review the state-of-the-art in each
category and list the available models. Following that, we present a detailed
analysis and comparison of each model and list its pros and cons. After that,
we present our insights and lessons learned for each category. Moreover, we
discuss several open research challenges and future research directions.
Related papers
- ViTs are Everywhere: A Comprehensive Study Showcasing Vision
Transformers in Different Domain [0.0]
Vision Transformers (ViTs) are becoming more popular and dominant solutions for many vision problems.
ViTs can overcome several possible difficulties with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2023-10-09T12:31:30Z) - PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - Understanding Gaussian Attention Bias of Vision Transformers Using
Effective Receptive Fields [7.58745191859815]
Vision transformers (ViTs) that model an image as a sequence of partitioned patches have shown notable performance in diverse vision tasks.
We propose explicitly adding a Gaussian attention bias that guides the positional embedding to have the corresponding pattern from the beginning of training.
The results showed that proposed method not only facilitates ViTs to understand images but also boosts their performance on various datasets.
arXiv Detail & Related papers (2023-05-08T14:12:25Z) - RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in
Autonomous Driving [80.14669385741202]
Vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks.
ViTs are notoriously hard to train and require a lot of training data to learn powerful representations.
We show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and Semantic KITTI.
arXiv Detail & Related papers (2023-01-24T18:50:48Z) - What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator [21.351034332423374]
We propose a novel ViT based fine-grained object discriminator for Fine-Grained Visual Classification (FGVC) tasks.
Besides a ViT backbone, it introduces three novel components, i.e. Attention Patch Combination (APC), Critical Regions Filter (CRF) and Complementary Tokens Integration (CTI)
We conduct comprehensive experiments on widely used datasets and the results demonstrate that ViT-FOD is able to achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-03-24T02:34:57Z) - Recent Advances in Vision Transformer: A Survey and Outlook of Recent
Work [1.6317061277457001]
Vision Transformers (ViTs) are becoming more popular and dominating technique for various vision tasks, compare to Convolutional Neural Networks (CNNs)
As a demanding technique in computer vision, ViTs have been successfully solved various vision problems while focusing on long-range relationships.
We thoroughly compare the performance of various ViT algorithms and most representative CNN methods on popular benchmark datasets.
arXiv Detail & Related papers (2022-03-03T06:17:03Z) - ViR:the Vision Reservoir [10.881974985012839]
Vision Reservoir computing (ViR) is proposed here for image classification, as a parallel to Vision Transformer (ViT)
By splitting each image into a sequence of tokens with fixed length, the ViR constructs a pure reservoir with a nearly fully connected topology to replace the Transformer module in ViT.
The number of parameters of the ViR is about 15% even 5% of the ViT, and the memory footprint is about 20% to 40% of the ViT.
arXiv Detail & Related papers (2021-12-27T07:07:50Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Factors of Influence for Transfer Learning across Diverse Appearance
Domains and Task Types [50.1843146606122]
A simple form of transfer learning is common in current state-of-the-art computer vision models.
Previous systematic studies of transfer learning have been limited and the circumstances in which it is expected to work are not fully understood.
In this paper we carry out an extensive experimental exploration of transfer learning across vastly different image domains.
arXiv Detail & Related papers (2021-03-24T16:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.