Exploring Vision Transformers for Fine-grained Classification
- URL: http://arxiv.org/abs/2106.10587v1
- Date: Sat, 19 Jun 2021 23:57:31 GMT
- Title: Exploring Vision Transformers for Fine-grained Classification
- Authors: Marcos V. Conde and Kerem Turgutlu
- Abstract summary: We propose a multi-stage ViT framework for fine-grained image classification tasks, which localizes the informative image regions without requiring architectural changes.
We demonstrate the value of our approach by experimenting with four popular fine-grained benchmarks: CUB-200-2011, Stanford Cars, Stanford Dogs, and FGVC7 Plant Pathology.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing computer vision research in categorization struggles with
fine-grained attributes recognition due to the inherently high intra-class
variances and low inter-class variances. SOTA methods tackle this challenge by
locating the most informative image regions and rely on them to classify the
complete image. The most recent work, Vision Transformer (ViT), shows its
strong performance in both traditional and fine-grained classification tasks.
In this work, we propose a multi-stage ViT framework for fine-grained image
classification tasks, which localizes the informative image regions without
requiring architectural changes using the inherent multi-head self-attention
mechanism. We also introduce attention-guided augmentations for improving the
model's capabilities. We demonstrate the value of our approach by experimenting
with four popular fine-grained benchmarks: CUB-200-2011, Stanford Cars,
Stanford Dogs, and FGVC7 Plant Pathology. We also prove our model's
interpretability via qualitative results.
Related papers
- Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - A Comprehensive Study of Vision Transformers in Image Classification
Tasks [0.46040036610482665]
We conduct a comprehensive survey of existing papers on Vision Transformers for image classification.
We first introduce the popular image classification datasets that influenced the design of models.
We present Vision Transformers models in chronological order, starting with early attempts at adapting attention mechanism to vision tasks.
arXiv Detail & Related papers (2023-12-02T21:38:16Z) - Locality-Aware Hyperspectral Classification [8.737375836744933]
We introduce the Hyperspectral Locality-aware Image TransformEr (HyLITE), a vision transformer that models both local and spectral information.
Our proposed approach outperforms competing baselines by a significant margin, achieving up to 10% gains in accuracy.
arXiv Detail & Related papers (2023-09-04T12:29:32Z) - Fine-grained Recognition with Learnable Semantic Data Augmentation [68.48892326854494]
Fine-grained image recognition is a longstanding computer vision challenge.
We propose diversifying the training data at the feature-level to alleviate the discriminative region loss problem.
Our method significantly improves the generalization performance on several popular classification networks.
arXiv Detail & Related papers (2023-09-01T11:15:50Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Towards Fine-grained Image Classification with Generative Adversarial
Networks and Facial Landmark Detection [0.0]
We use GAN-based data augmentation to generate extra dataset instances.
We validated our work by evaluating the accuracy of fine-grained image classification on the recent Vision Transformer (ViT) Model.
arXiv Detail & Related papers (2021-08-28T06:32:42Z) - Efficient Self-supervised Vision Transformers for Representation
Learning [86.57557009109411]
We show that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity.
We propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies.
Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation.
arXiv Detail & Related papers (2021-06-17T19:57:33Z) - TransFG: A Transformer Architecture for Fine-grained Recognition [27.76159820385425]
Recently, vision transformer (ViT) shows its strong performance in the traditional classification task.
We propose a novel transformer-based framework TransFG where we integrate all raw attention weights of the transformer into an attention map.
A contrastive loss is applied to further enlarge the distance between feature representations of similar sub-classes.
arXiv Detail & Related papers (2021-03-14T17:03:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.