Salient Mask-Guided Vision Transformer for Fine-Grained Classification
- URL: http://arxiv.org/abs/2305.07102v1
- Date: Thu, 11 May 2023 19:24:33 GMT
- Title: Salient Mask-Guided Vision Transformer for Fine-Grained Classification
- Authors: Dmitry Demidov, Muhammad Hamza Sharif, Aliakbar Abdurahimov, Hisham
Cholakkal, Fahad Shahbaz Khan
- Abstract summary: Fine-grained visual classification (FGVC) is a challenging computer vision problem.
One of its main difficulties is capturing the most discriminative inter-class variances.
We introduce a simple yet effective Salient Mask-Guided Vision Transformer (SM-ViT)
- Score: 48.1425692047256
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Fine-grained visual classification (FGVC) is a challenging computer vision
problem, where the task is to automatically recognise objects from subordinate
categories. One of its main difficulties is capturing the most discriminative
inter-class variances among visually similar classes. Recently, methods with
Vision Transformer (ViT) have demonstrated noticeable achievements in FGVC,
generally by employing the self-attention mechanism with additional
resource-consuming techniques to distinguish potentially discriminative regions
while disregarding the rest. However, such approaches may struggle to
effectively focus on truly discriminative regions due to only relying on the
inherent self-attention mechanism, resulting in the classification token likely
aggregating global information from less-important background patches.
Moreover, due to the immense lack of the datapoints, classifiers may fail to
find the most helpful inter-class distinguishing features, since other
unrelated but distinctive background regions may be falsely recognised as being
valuable. To this end, we introduce a simple yet effective Salient Mask-Guided
Vision Transformer (SM-ViT), where the discriminability of the standard ViT`s
attention maps is boosted through salient masking of potentially discriminative
foreground regions. Extensive experiments demonstrate that with the standard
training procedure our SM-ViT achieves state-of-the-art performance on popular
FGVC benchmarks among existing ViT-based approaches while requiring fewer
resources and lower input image resolution.
Related papers
- Rethinking the Domain Gap in Near-infrared Face Recognition [65.7871950460781]
Heterogeneous face recognition (HFR) involves the intricate task of matching face images across the visual domains of visible (VIS) and near-infrared (NIR)
Much of the existing literature on HFR identifies the domain gap as a primary challenge and directs efforts towards bridging it at either the input or feature level.
We observe that large neural networks, unlike their smaller counterparts, when pre-trained on large scale homogeneous VIS data, demonstrate exceptional zero-shot performance in HFR.
arXiv Detail & Related papers (2023-12-01T14:43:28Z) - Fine-grained Recognition with Learnable Semantic Data Augmentation [68.48892326854494]
Fine-grained image recognition is a longstanding computer vision challenge.
We propose diversifying the training data at the feature-level to alleviate the discriminative region loss problem.
Our method significantly improves the generalization performance on several popular classification networks.
arXiv Detail & Related papers (2023-09-01T11:15:50Z) - Learning Common Rationale to Improve Self-Supervised Representation for
Fine-Grained Visual Recognition Problems [61.11799513362704]
We propose learning an additional screening mechanism to identify discriminative clues commonly seen across instances and classes.
We show that a common rationale detector can be learned by simply exploiting the GradCAM induced from the SSL objective.
arXiv Detail & Related papers (2023-03-03T02:07:40Z) - Cluster-level pseudo-labelling for source-free cross-domain facial
expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER)
Our method exploits self-supervised pretraining to learn good feature representations from the target data.
We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z) - R2-Trans:Fine-Grained Visual Categorization with Redundancy Reduction [21.11038841356125]
Fine-grained visual categorization (FGVC) aims to discriminate similar subcategories, whose main challenge is the large intraclass diversities and subtle inter-class differences.
We present a novel approach for FGVC, which can simultaneously make use of partial yet sufficient discriminative information in environmental cues and also compress the redundant information in class-token with respect to the target.
arXiv Detail & Related papers (2022-04-21T13:35:38Z) - Mask-Guided Feature Extraction and Augmentation for Ultra-Fine-Grained
Visual Categorization [15.627971638835948]
The Ultra-fine-grained visual categorization (Ultra-FGVC) problems have been understudied.
FGVC aims at classifying objects from the same species, while the Ultra-FGVC targets at more challenging problems of classifying images at an ultra-fine granularity.
The challenges for Ultra-FGVC mainly comes from two aspects: one is that the Ultra-FGVC often arises overfitting problems due to the lack of training samples.
A mask-guided feature extraction and feature augmentation method is proposed in this paper to extract discriminative and informative regions of images.
arXiv Detail & Related papers (2021-09-16T06:57:05Z) - Feature Fusion Vision Transformer for Fine-Grained Visual Categorization [22.91753200323264]
We propose a novel pure transformer-based framework Feature Fusion Vision Transformer (FFVT)
We aggregate the important tokens from each transformer layer to compensate the local, low-level and middle-level information.
We design a novel token selection mod-ule called mutual attention weight selection (MAWS) to guide the network effectively and efficiently towards selecting discriminative tokens.
arXiv Detail & Related papers (2021-07-06T01:48:43Z) - Exploring Vision Transformers for Fine-grained Classification [0.0]
We propose a multi-stage ViT framework for fine-grained image classification tasks, which localizes the informative image regions without requiring architectural changes.
We demonstrate the value of our approach by experimenting with four popular fine-grained benchmarks: CUB-200-2011, Stanford Cars, Stanford Dogs, and FGVC7 Plant Pathology.
arXiv Detail & Related papers (2021-06-19T23:57:31Z) - Interpretable Attention Guided Network for Fine-grained Visual
Classification [36.657203916383594]
Fine-grained visual classification (FGVC) is challenging but more critical than traditional classification tasks.
We propose an Interpretable Attention Guided Network (IAGN) for fine-grained visual classification.
arXiv Detail & Related papers (2021-03-08T12:27:51Z) - Fine-Grained Visual Classification via Progressive Multi-Granularity
Training of Jigsaw Patches [67.51747235117]
Fine-grained visual classification (FGVC) is much more challenging than traditional classification tasks.
Recent works mainly tackle this problem by focusing on how to locate the most discriminative parts.
We propose a novel framework for fine-grained visual classification to tackle these problems.
arXiv Detail & Related papers (2020-03-08T19:27:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.