TransFG: A Transformer Architecture for Fine-grained Recognition
- URL: http://arxiv.org/abs/2103.07976v3
- Date: Wed, 17 Mar 2021 04:03:30 GMT
- Title: TransFG: A Transformer Architecture for Fine-grained Recognition
- Authors: Ju He, Jieneng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong
Bai, Changhu Wang, Alan Yuille
- Abstract summary: Recently, vision transformer (ViT) shows its strong performance in the traditional classification task.
We propose a novel transformer-based framework TransFG where we integrate all raw attention weights of the transformer into an attention map.
A contrastive loss is applied to further enlarge the distance between feature representations of similar sub-classes.
- Score: 27.76159820385425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-grained visual classification (FGVC) which aims at recognizing objects
from subcategories is a very challenging task due to the inherently subtle
inter-class differences. Recent works mainly tackle this problem by focusing on
how to locate the most discriminative image regions and rely on them to improve
the capability of networks to capture subtle variances. Most of these works
achieve this by re-using the backbone network to extract features of selected
regions. However, this strategy inevitably complicates the pipeline and pushes
the proposed regions to contain most parts of the objects. Recently, vision
transformer (ViT) shows its strong performance in the traditional
classification task. The self-attention mechanism of the transformer links
every patch token to the classification token. The strength of the attention
link can be intuitively considered as an indicator of the importance of tokens.
In this work, we propose a novel transformer-based framework TransFG where we
integrate all raw attention weights of the transformer into an attention map
for guiding the network to effectively and accurately select discriminative
image patches and compute their relations. A contrastive loss is applied to
further enlarge the distance between feature representations of similar
sub-classes. We demonstrate the value of TransFG by conducting experiments on
five popular fine-grained benchmarks: CUB-200-2011, Stanford Cars, Stanford
Dogs, NABirds and iNat2017 where we achieve state-of-the-art performance.
Qualitative results are presented for better understanding of our model.
Related papers
- Hierarchical Graph Interaction Transformer with Dynamic Token Clustering for Camouflaged Object Detection [57.883265488038134]
We propose a hierarchical graph interaction network termed HGINet for camouflaged object detection.
The network is capable of discovering imperceptible objects via effective graph interaction among the hierarchical tokenized features.
Our experiments demonstrate the superior performance of HGINet compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2024-08-27T12:53:25Z) - Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers [5.825612611197359]
Fine-grained recognition involves the classification of images from subordinate macro-categories.
We propose a novel and computationally inexpensive metric to identify discriminative regions in an image.
Our method achieves these results at a much lower computational cost compared to the alternatives.
arXiv Detail & Related papers (2024-07-17T10:04:54Z) - Fine-grained Recognition with Learnable Semantic Data Augmentation [68.48892326854494]
Fine-grained image recognition is a longstanding computer vision challenge.
We propose diversifying the training data at the feature-level to alleviate the discriminative region loss problem.
Our method significantly improves the generalization performance on several popular classification networks.
arXiv Detail & Related papers (2023-09-01T11:15:50Z) - Accurate Image Restoration with Attention Retractable Transformer [50.05204240159985]
We propose Attention Retractable Transformer (ART) for image restoration.
ART presents both dense and sparse attention modules in the network.
We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks.
arXiv Detail & Related papers (2022-10-04T07:35:01Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - RAMS-Trans: Recurrent Attention Multi-scale Transformer forFine-grained
Image Recognition [26.090419694326823]
localization and amplification of region attention is an important factor, which has been explored a lot by convolutional neural networks (CNNs) based approaches.
We propose the recurrent attention multi-scale transformer (RAMS-Trans) which uses the transformer's self-attention to learn discriminative region attention.
arXiv Detail & Related papers (2021-07-17T06:22:20Z) - HAT: Hierarchical Aggregation Transformers for Person Re-identification [87.02828084991062]
We take advantages of both CNNs and Transformers for image-based person Re-ID with high performance.
Work is the first to take advantages of both CNNs and Transformers for image-based person Re-ID.
arXiv Detail & Related papers (2021-07-13T09:34:54Z) - Feature Fusion Vision Transformer for Fine-Grained Visual Categorization [22.91753200323264]
We propose a novel pure transformer-based framework Feature Fusion Vision Transformer (FFVT)
We aggregate the important tokens from each transformer layer to compensate the local, low-level and middle-level information.
We design a novel token selection mod-ule called mutual attention weight selection (MAWS) to guide the network effectively and efficiently towards selecting discriminative tokens.
arXiv Detail & Related papers (2021-07-06T01:48:43Z) - Exploring Vision Transformers for Fine-grained Classification [0.0]
We propose a multi-stage ViT framework for fine-grained image classification tasks, which localizes the informative image regions without requiring architectural changes.
We demonstrate the value of our approach by experimenting with four popular fine-grained benchmarks: CUB-200-2011, Stanford Cars, Stanford Dogs, and FGVC7 Plant Pathology.
arXiv Detail & Related papers (2021-06-19T23:57:31Z) - Context-aware Attentional Pooling (CAP) for Fine-grained Visual
Classification [2.963101656293054]
Deep convolutional neural networks (CNNs) have shown a strong ability in mining discriminative object pose and parts information for image recognition.
We propose a novel context-aware attentional pooling (CAP) that effectively captures subtle changes via sub-pixel gradients.
We evaluate our approach using six state-of-the-art (SotA) backbone networks and eight benchmark datasets.
arXiv Detail & Related papers (2021-01-17T10:15:02Z) - Transformer Interpretability Beyond Attention Visualization [87.96102461221415]
Self-attention techniques, and specifically Transformers, are dominating the field of text processing.
In this work, we propose a novel way to compute relevancy for Transformer networks.
arXiv Detail & Related papers (2020-12-17T18:56:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.