Related papers: Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work

Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work

URL: http://arxiv.org/abs/2203.01536v5
Date: Tue, 17 Oct 2023 06:05:41 GMT
Title: Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work
Authors: Khawar Islam
Abstract summary: Vision Transformers (ViTs) are becoming more popular and dominating technique for various vision tasks, compare to Convolutional Neural Networks (CNNs) As a demanding technique in computer vision, ViTs have been successfully solved various vision problems while focusing on long-range relationships. We thoroughly compare the performance of various ViT algorithms and most representative CNN methods on popular benchmark datasets.
Score: 1.6317061277457001
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision Transformers (ViTs) are becoming more popular and dominating technique for various vision tasks, compare to Convolutional Neural Networks (CNNs). As a demanding technique in computer vision, ViTs have been successfully solved various vision problems while focusing on long-range relationships. In this paper, we begin by introducing the fundamental concepts and background of the self-attention mechanism. Next, we provide a comprehensive overview of recent top-performing ViT methods describing in terms of strength and weakness, computational cost as well as training and testing dataset. We thoroughly compare the performance of various ViT algorithms and most representative CNN methods on popular benchmark datasets. Finally, we explore some limitations with insightful observations and provide further research direction. The project page along with the collections of papers are available at https://github.com/khawar512/ViT-Survey

Related papers

Vision Transformers in Precision Agriculture: A Comprehensive Survey [3.156133122658662]
Vision Transformers (ViTs) offer benefits such as improved handling of long-range dependencies and better scalability for visual tasks. This survey explores the application of ViTs in precision agriculture, covering tasks from classification to detection and segmentation.
arXiv Detail & Related papers (2025-04-30T14:50:02Z)
A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis [9.687982148528187]
Convolutional Neural Networks (CNNs) are currently among the best texture analysis approaches. Vision Transformers (ViTs) have been surpassing the performance of CNNs on tasks such as object recognition. This work explores various pre-trained ViT architectures when transferred to tasks that rely on textures.
arXiv Detail & Related papers (2024-06-10T09:48:13Z)
ViTs are Everywhere: A Comprehensive Study Showcasing Vision Transformers in Different Domain [0.0]
Vision Transformers (ViTs) are becoming more popular and dominant solutions for many vision problems. ViTs can overcome several possible difficulties with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2023-10-09T12:31:30Z)
PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications. ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations. We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z)
What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision. This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs. We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z)
Vision Transformer Visualization: What Neurons Tell and How Neurons Behave? [33.87454837848252]
We propose an effective visualization technique to assist us in exposing the information carried in neurons and feature embeddings across the vision transformers (ViTs) Our approach departs from the computational process of ViTs with a focus on visualizing the local and global information in input images and the latent feature embeddings at multiple levels. Next, we develop a rigorous framework to perform effective visualizations across layers, exposing the effects of ViTs filters and grouping/clustering behaviors to object patches.
arXiv Detail & Related papers (2022-10-14T08:56:24Z)
Self-Distilled Vision Transformer for Domain Generalization [58.76055100157651]
Vision transformers (ViTs) are challenging the supremacy of CNNs on standard benchmarks. We propose a simple DG approach for ViTs, coined as self-distillation for ViTs. We empirically demonstrate notable performance gains with different DG baselines and various ViT backbones in five challenging datasets.
arXiv Detail & Related papers (2022-07-25T17:57:05Z)
Can Vision Transformers Perform Convolution? [78.42076260340869]
We prove that a single ViT layer with image patches as the input can perform any convolution operation constructively. We provide a lower bound on the number of heads for Vision Transformers to express CNNs.
arXiv Detail & Related papers (2021-11-02T03:30:17Z)
Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. Are they acting like convolutional networks, or learning entirely different visual representations? We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z)
Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence. Transformers require minimal inductive biases for their design and are naturally suited as set-functions. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.