Recent Advances in Vision Transformer: A Survey and Outlook of Recent
Work
- URL: http://arxiv.org/abs/2203.01536v5
- Date: Tue, 17 Oct 2023 06:05:41 GMT
- Title: Recent Advances in Vision Transformer: A Survey and Outlook of Recent
Work
- Authors: Khawar Islam
- Abstract summary: Vision Transformers (ViTs) are becoming more popular and dominating technique for various vision tasks, compare to Convolutional Neural Networks (CNNs)
As a demanding technique in computer vision, ViTs have been successfully solved various vision problems while focusing on long-range relationships.
We thoroughly compare the performance of various ViT algorithms and most representative CNN methods on popular benchmark datasets.
- Score: 1.6317061277457001
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers (ViTs) are becoming more popular and dominating technique
for various vision tasks, compare to Convolutional Neural Networks (CNNs). As a
demanding technique in computer vision, ViTs have been successfully solved
various vision problems while focusing on long-range relationships. In this
paper, we begin by introducing the fundamental concepts and background of the
self-attention mechanism. Next, we provide a comprehensive overview of recent
top-performing ViT methods describing in terms of strength and weakness,
computational cost as well as training and testing dataset. We thoroughly
compare the performance of various ViT algorithms and most representative CNN
methods on popular benchmark datasets. Finally, we explore some limitations
with insightful observations and provide further research direction. The
project page along with the collections of papers are available at
https://github.com/khawar512/ViT-Survey
Related papers
- A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis [9.687982148528187]
Convolutional Neural Networks (CNNs) are currently among the best texture analysis approaches.
Vision Transformers (ViTs) have been surpassing the performance of CNNs on tasks such as object recognition.
This work explores various pre-trained ViT architectures when transferred to tasks that rely on textures.
arXiv Detail & Related papers (2024-06-10T09:48:13Z) - ViTs are Everywhere: A Comprehensive Study Showcasing Vision
Transformers in Different Domain [0.0]
Vision Transformers (ViTs) are becoming more popular and dominant solutions for many vision problems.
ViTs can overcome several possible difficulties with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2023-10-09T12:31:30Z) - PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z) - Vision Transformer Visualization: What Neurons Tell and How Neurons
Behave? [33.87454837848252]
We propose an effective visualization technique to assist us in exposing the information carried in neurons and feature embeddings across the vision transformers (ViTs)
Our approach departs from the computational process of ViTs with a focus on visualizing the local and global information in input images and the latent feature embeddings at multiple levels.
Next, we develop a rigorous framework to perform effective visualizations across layers, exposing the effects of ViTs filters and grouping/clustering behaviors to object patches.
arXiv Detail & Related papers (2022-10-14T08:56:24Z) - Self-Distilled Vision Transformer for Domain Generalization [58.76055100157651]
Vision transformers (ViTs) are challenging the supremacy of CNNs on standard benchmarks.
We propose a simple DG approach for ViTs, coined as self-distillation for ViTs.
We empirically demonstrate notable performance gains with different DG baselines and various ViT backbones in five challenging datasets.
arXiv Detail & Related papers (2022-07-25T17:57:05Z) - Can Vision Transformers Perform Convolution? [78.42076260340869]
We prove that a single ViT layer with image patches as the input can perform any convolution operation constructively.
We provide a lower bound on the number of heads for Vision Transformers to express CNNs.
arXiv Detail & Related papers (2021-11-02T03:30:17Z) - Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks.
Are they acting like convolutional networks, or learning entirely different visual representations?
We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.