Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers
- URL: http://arxiv.org/abs/2407.12891v1
- Date: Wed, 17 Jul 2024 10:04:54 GMT
- Title: Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers
- Authors: Edwin Arkel Rios, Min-Chun Hu, Bo-Cheng Lai,
- Abstract summary: Fine-grained recognition involves the classification of images from subordinate macro-categories.
We propose a novel and computationally inexpensive metric to identify discriminative regions in an image.
Our method achieves these results at a much lower computational cost compared to the alternatives.
- Score: 5.825612611197359
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-grained recognition involves the classification of images from subordinate macro-categories, and it is challenging due to small inter-class differences. To overcome this, most methods perform discriminative feature selection enabled by a feature extraction backbone followed by a high-level feature refinement step. Recently, many studies have shown the potential behind vision transformers as a backbone for fine-grained recognition, but their usage of its attention mechanism to select discriminative tokens can be computationally expensive. In this work, we propose a novel and computationally inexpensive metric to identify discriminative regions in an image. We compare the similarity between the global representation of an image given by the CLS token, a learnable token used by transformers for classification, and the local representation of individual patches. We select the regions with the highest similarity to obtain crops, which are forwarded through the same transformer encoder. Finally, high-level features of the original and cropped representations are further refined together in order to make more robust predictions. Through extensive experimental evaluation we demonstrate the effectiveness of our proposed method, obtaining favorable results in terms of accuracy across a variety of datasets. Furthermore, our method achieves these results at a much lower computational cost compared to the alternatives. Code and checkpoints are available at: \url{https://github.com/arkel23/GLSim}.
Related papers
- PlaceFormer: Transformer-based Visual Place Recognition using Multi-Scale Patch Selection and Fusion [2.3020018305241337]
PlaceFormer is a transformer-based approach for visual place recognition.
PlaceFormer employs patch tokens from the transformer to create global image descriptors.
It selects patches that correspond to task-relevant areas in an image.
arXiv Detail & Related papers (2024-01-23T20:28:06Z) - Convolutional autoencoder-based multimodal one-class classification [80.52334952912808]
One-class classification refers to approaches of learning using data from a single class only.
We propose a deep learning one-class classification method suitable for multimodal data.
arXiv Detail & Related papers (2023-09-25T12:31:18Z) - Fine-grained Recognition with Learnable Semantic Data Augmentation [68.48892326854494]
Fine-grained image recognition is a longstanding computer vision challenge.
We propose diversifying the training data at the feature-level to alleviate the discriminative region loss problem.
Our method significantly improves the generalization performance on several popular classification networks.
arXiv Detail & Related papers (2023-09-01T11:15:50Z) - Learning A Sparse Transformer Network for Effective Image Deraining [42.01684644627124]
We propose an effective DeRaining network, Sparse Transformer (DRSformer)
We develop a learnable top-k selection operator to adaptively retain the most crucial attention scores from the keys for each query for better feature aggregation.
We equip our model with mixture of experts feature compensator to present a cooperation refinement deraining scheme.
arXiv Detail & Related papers (2023-03-21T15:41:57Z) - Hierarchical Forgery Classifier On Multi-modality Face Forgery Clues [61.37306431455152]
We propose a novel Hierarchical Forgery for Multi-modality Face Forgery Detection (HFC-MFFD)
The HFC-MFFD learns robust patches-based hybrid representation to enhance forgery authentication in multiple-modality scenarios.
The specific hierarchical face forgery is proposed to alleviate the class imbalance problem and further boost detection performance.
arXiv Detail & Related papers (2022-12-30T10:54:29Z) - LEAD: Self-Supervised Landmark Estimation by Aligning Distributions of
Feature Similarity [49.84167231111667]
Existing works in self-supervised landmark detection are based on learning dense (pixel-level) feature representations from an image.
We introduce an approach to enhance the learning of dense equivariant representations in a self-supervised fashion.
We show that having such a prior in the feature extractor helps in landmark detection, even under drastically limited number of annotations.
arXiv Detail & Related papers (2022-04-06T17:48:18Z) - Efficient Video Transformers with Spatial-Temporal Token Selection [68.27784654734396]
We present STTS, a token selection framework that dynamically selects a few informative tokens in both temporal and spatial dimensions conditioned on input video samples.
Our framework achieves similar results while requiring 20% less computation.
arXiv Detail & Related papers (2021-11-23T00:35:58Z) - TransFG: A Transformer Architecture for Fine-grained Recognition [27.76159820385425]
Recently, vision transformer (ViT) shows its strong performance in the traditional classification task.
We propose a novel transformer-based framework TransFG where we integrate all raw attention weights of the transformer into an attention map.
A contrastive loss is applied to further enlarge the distance between feature representations of similar sub-classes.
arXiv Detail & Related papers (2021-03-14T17:03:53Z) - Re-rank Coarse Classification with Local Region Enhanced Features for
Fine-Grained Image Recognition [22.83821575990778]
We re-rank the TopN classification results by using the local region enhanced embedding features to improve the Top1 accuracy.
To learn more effective semantic global features, we design a multi-level loss over an automatically constructed hierarchical category structure.
Our method achieves state-of-the-art performance on three benchmarks: CUB-200-2011, Stanford Cars, and FGVC Aircraft.
arXiv Detail & Related papers (2021-02-19T11:30:25Z) - Image Fine-grained Inpainting [89.17316318927621]
We present a one-stage model that utilizes dense combinations of dilated convolutions to obtain larger and more effective receptive fields.
To better train this efficient generator, except for frequently-used VGG feature matching loss, we design a novel self-guided regression loss.
We also employ a discriminator with local and global branches to ensure local-global contents consistency.
arXiv Detail & Related papers (2020-02-07T03:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.