Siamese Transformer Networks for Few-shot Image Classification
- URL: http://arxiv.org/abs/2408.01427v1
- Date: Tue, 16 Jul 2024 14:27:23 GMT
- Title: Siamese Transformer Networks for Few-shot Image Classification
- Authors: Weihao Jiang, Shuoxi Zhang, Kun He,
- Abstract summary: Humans exhibit remarkable proficiency in visual classification tasks, accurately recognizing and classifying new images with minimal examples.
Existing few-shot image classification methods often emphasize either global features or local features, with few studies considering the integration of both.
We propose a novel approach based on the Siamese Transformer Network (STN)
Our strategy effectively harnesses the potential of global and local features in few-shot image classification, circumventing the need for complex feature adaptation modules.
- Score: 9.55588609556447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans exhibit remarkable proficiency in visual classification tasks, accurately recognizing and classifying new images with minimal examples. This ability is attributed to their capacity to focus on details and identify common features between previously seen and new images. In contrast, existing few-shot image classification methods often emphasize either global features or local features, with few studies considering the integration of both. To address this limitation, we propose a novel approach based on the Siamese Transformer Network (STN). Our method employs two parallel branch networks utilizing the pre-trained Vision Transformer (ViT) architecture to extract global and local features, respectively. Specifically, we implement the ViT-Small network architecture and initialize the branch networks with pre-trained model parameters obtained through self-supervised learning. We apply the Euclidean distance measure to the global features and the Kullback-Leibler (KL) divergence measure to the local features. To integrate the two metrics, we first employ L2 normalization and then weight the normalized results to obtain the final similarity score. This strategy leverages the advantages of both global and local features while ensuring their complementary benefits. During the training phase, we adopt a meta-learning approach to fine-tune the entire network. Our strategy effectively harnesses the potential of global and local features in few-shot image classification, circumventing the need for complex feature adaptation modules and enhancing the model's generalization ability. Extensive experiments demonstrate that our framework is simple yet effective, achieving superior performance compared to state-of-the-art baselines on four popular few-shot classification benchmarks in both 5-shot and 1-shot scenarios.
Related papers
- Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - Affine-Consistent Transformer for Multi-Class Cell Nuclei Detection [76.11864242047074]
We propose a novel Affine-Consistent Transformer (AC-Former), which directly yields a sequence of nucleus positions.
We introduce an Adaptive Affine Transformer (AAT) module, which can automatically learn the key spatial transformations to warp original images for local network training.
Experimental results demonstrate that the proposed method significantly outperforms existing state-of-the-art algorithms on various benchmarks.
arXiv Detail & Related papers (2023-10-22T02:27:02Z) - Keypoint-Augmented Self-Supervised Learning for Medical Image
Segmentation with Limited Annotation [21.203307064937142]
We present a keypointaugmented fusion layer that extracts representations preserving both short- and long-range self-attention.
In particular, we augment the CNN feature map at multiple scales by incorporating an additional input that learns long-range spatial selfattention.
Our method further outperforms existing SSL methods by producing more robust self-attention.
arXiv Detail & Related papers (2023-10-02T22:31:30Z) - TOPIQ: A Top-down Approach from Semantics to Distortions for Image
Quality Assessment [53.72721476803585]
Image Quality Assessment (IQA) is a fundamental task in computer vision that has witnessed remarkable progress with deep neural networks.
We propose a top-down approach that uses high-level semantics to guide the IQA network to focus on semantically important local distortion regions.
A key component of our approach is the proposed cross-scale attention mechanism, which calculates attention maps for lower level features.
arXiv Detail & Related papers (2023-08-06T09:08:37Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - VICRegL: Self-Supervised Learning of Local Visual Features [34.92750644059916]
This paper explores the fundamental trade-off between learning local and global features.
A new method called VICRegL is proposed that learns good global and local features simultaneously.
We demonstrate strong performance on linear classification and segmentation transfer tasks.
arXiv Detail & Related papers (2022-10-04T12:54:25Z) - Cross-modal Local Shortest Path and Global Enhancement for
Visible-Thermal Person Re-Identification [2.294635424666456]
We propose the Cross-modal Local Shortest Path and Global Enhancement (CM-LSP-GE) modules,a two-stream network based on joint learning of local and global features.
The experimental results on two typical datasets show that our model is obviously superior to the most state-of-the-art methods.
arXiv Detail & Related papers (2022-06-09T10:27:22Z) - Re-rank Coarse Classification with Local Region Enhanced Features for
Fine-Grained Image Recognition [22.83821575990778]
We re-rank the TopN classification results by using the local region enhanced embedding features to improve the Top1 accuracy.
To learn more effective semantic global features, we design a multi-level loss over an automatically constructed hierarchical category structure.
Our method achieves state-of-the-art performance on three benchmarks: CUB-200-2011, Stanford Cars, and FGVC Aircraft.
arXiv Detail & Related papers (2021-02-19T11:30:25Z) - Region Comparison Network for Interpretable Few-shot Image
Classification [97.97902360117368]
Few-shot image classification has been proposed to effectively use only a limited number of labeled examples to train models for new classes.
We propose a metric learning based method named Region Comparison Network (RCN), which is able to reveal how few-shot learning works.
We also present a new way to generalize the interpretability from the level of tasks to categories.
arXiv Detail & Related papers (2020-09-08T07:29:05Z) - Inter-Image Communication for Weakly Supervised Localization [77.2171924626778]
Weakly supervised localization aims at finding target object regions using only image-level supervision.
We propose to leverage pixel-level similarities across different objects for learning more accurate object locations.
Our method achieves the Top-1 localization error rate of 45.17% on the ILSVRC validation set.
arXiv Detail & Related papers (2020-08-12T04:14:11Z) - ReMarNet: Conjoint Relation and Margin Learning for Small-Sample Image
Classification [49.87503122462432]
We introduce a novel neural network termed Relation-and-Margin learning Network (ReMarNet)
Our method assembles two networks of different backbones so as to learn the features that can perform excellently in both of the aforementioned two classification mechanisms.
Experiments on four image datasets demonstrate that our approach is effective in learning discriminative features from a small set of labeled samples.
arXiv Detail & Related papers (2020-06-27T13:50:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.