A Comprehensive Study of Vision Transformers in Image Classification
Tasks
- URL: http://arxiv.org/abs/2312.01232v2
- Date: Tue, 5 Dec 2023 03:46:27 GMT
- Title: A Comprehensive Study of Vision Transformers in Image Classification
Tasks
- Authors: Mahmoud Khalil, Ahmad Khalil and Alioune Ngom
- Abstract summary: We conduct a comprehensive survey of existing papers on Vision Transformers for image classification.
We first introduce the popular image classification datasets that influenced the design of models.
We present Vision Transformers models in chronological order, starting with early attempts at adapting attention mechanism to vision tasks.
- Score: 0.46040036610482665
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image Classification is a fundamental task in the field of computer vision
that frequently serves as a benchmark for gauging advancements in Computer
Vision. Over the past few years, significant progress has been made in image
classification due to the emergence of deep learning. However, challenges still
exist, such as modeling fine-grained visual information, high computation
costs, the parallelism of the model, and inconsistent evaluation protocols
across datasets. In this paper, we conduct a comprehensive survey of existing
papers on Vision Transformers for image classification. We first introduce the
popular image classification datasets that influenced the design of models.
Then, we present Vision Transformers models in chronological order, starting
with early attempts at adapting attention mechanism to vision tasks followed by
the adoption of vision transformers, as they have demonstrated success in
capturing intricate patterns and long-range dependencies within images.
Finally, we discuss open problems and shed light on opportunities for image
classification to facilitate new research ideas.
Related papers
- A Review of Transformer-Based Models for Computer Vision Tasks: Capturing Global Context and Spatial Relationships [0.5639904484784127]
Transformer-based models have transformed the landscape of natural language processing (NLP)
These models are renowned for their ability to capture long-range dependencies and contextual information.
We discuss potential research directions and applications of transformer-based models in computer vision.
arXiv Detail & Related papers (2024-08-27T16:22:18Z) - Analyzing Vision Transformers for Image Classification in Class
Embedding Space [5.210197476419621]
This work introduces a method to reverse-engineer Vision Transformers trained to solve image classification tasks.
Inspired by previous research in NLP, we demonstrate how the inner representations at any level of the hierarchy can be projected onto the learned class space.
We use our framework to show how image tokens develop class-specific representations that depend on attention mechanisms and contextual information.
arXiv Detail & Related papers (2023-10-29T10:25:23Z) - Visual Prompt Tuning for Generative Transfer Learning [26.895321693202284]
We present a recipe for learning vision transformers by generative knowledge transfer.
We base our framework on state-of-the-art generative vision transformers that represent an image as a sequence of visual tokens to the autoregressive or non-autoregressive transformers.
To adapt to a new domain, we employ prompt tuning, which prepends learnable tokens called prompt to the image token sequence.
arXiv Detail & Related papers (2022-10-03T14:56:05Z) - Advancing Plain Vision Transformer Towards Remote Sensing Foundation
Model [97.9548609175831]
We resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models customized for remote sensing tasks.
Specifically, to handle the large image size and objects of various orientations in RS images, we propose a new rotated varied-size window attention.
Experiments on detection tasks demonstrate the superiority of our model over all state-of-the-art models, achieving 81.16% mAP on the DOTA-V1.0 dataset.
arXiv Detail & Related papers (2022-08-08T09:08:40Z) - Exploring Vision Transformers for Fine-grained Classification [0.0]
We propose a multi-stage ViT framework for fine-grained image classification tasks, which localizes the informative image regions without requiring architectural changes.
We demonstrate the value of our approach by experimenting with four popular fine-grained benchmarks: CUB-200-2011, Stanford Cars, Stanford Dogs, and FGVC7 Plant Pathology.
arXiv Detail & Related papers (2021-06-19T23:57:31Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z) - CrossTransformers: spatially-aware few-shot transfer [92.33252608837947]
Given new tasks with very little data, modern vision systems degrade remarkably quickly.
We show how the neural network representations which underpin modern vision systems are subject to supervision collapse.
We propose self-supervised learning to encourage general-purpose features that transfer better.
arXiv Detail & Related papers (2020-07-22T15:37:08Z) - Fine-grained Image-to-Image Transformation towards Visual Recognition [102.51124181873101]
We aim at transforming an image with a fine-grained category to synthesize new images that preserve the identity of the input image.
We adopt a model based on generative adversarial networks to disentangle the identity related and unrelated factors of an image.
Experiments on the CompCars and Multi-PIE datasets demonstrate that our model preserves the identity of the generated images much better than the state-of-the-art image-to-image transformation models.
arXiv Detail & Related papers (2020-01-12T05:26:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.