Semantic Cross Attention for Few-shot Learning
- URL: http://arxiv.org/abs/2210.06311v1
- Date: Wed, 12 Oct 2022 15:24:59 GMT
- Title: Semantic Cross Attention for Few-shot Learning
- Authors: Bin Xiao, Chien-Liang Liu, Wen-Hoar Hsaio
- Abstract summary: We propose a multi-task learning approach to view semantic features of label text as an auxiliary task.
Our proposed model uses word-embedding representations as semantic features to help train the embedding network and a semantic cross-attention module to bridge the semantic features into the typical visual modal.
- Score: 9.529264466445236
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Few-shot learning (FSL) has attracted considerable attention recently. Among
existing approaches, the metric-based method aims to train an embedding network
that can make similar samples close while dissimilar samples as far as possible
and achieves promising results. FSL is characterized by using only a few images
to train a model that can generalize to novel classes in image classification
problems, but this setting makes it difficult to learn the visual features that
can identify the images' appearance variations. The model training is likely to
move in the wrong direction, as the images in an identical semantic class may
have dissimilar appearances, whereas the images in different semantic classes
may share a similar appearance. We argue that FSL can benefit from additional
semantic features to learn discriminative feature representations. Thus, this
study proposes a multi-task learning approach to view semantic features of
label text as an auxiliary task to help boost the performance of the FSL task.
Our proposed model uses word-embedding representations as semantic features to
help train the embedding network and a semantic cross-attention module to
bridge the semantic features into the typical visual modal. The proposed
approach is simple, but produces excellent results. We apply our proposed
approach to two previous metric-based FSL methods, all of which can
substantially improve performance. The source code for our model is accessible
from github.
Related papers
- Text Descriptions are Compressive and Invariant Representations for
Visual Learning [63.3464863723631]
We show that an alternative approach, in line with humans' understanding of multiple visual features per class, can provide compelling performance in the robust few-shot learning setting.
In particular, we introduce a novel method, textit SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors).
This method first automatically generates multiple visual descriptions of each class via a large language model (LLM), then uses a VLM to translate these descriptions to a set of visual feature embeddings of each image, and finally uses sparse logistic regression to select a relevant subset of these features to classify
arXiv Detail & Related papers (2023-07-10T03:06:45Z) - ESPT: A Self-Supervised Episodic Spatial Pretext Task for Improving
Few-Shot Learning [16.859375666701]
We propose to augment the few-shot learning objective with a novel self-supervised Episodic Spatial Pretext Task (ESPT)
Our ESPT objective is defined as maximizing the local spatial relationship consistency between the original episode and the transformed one.
Our ESPT method achieves new state-of-the-art performance for few-shot image classification on three mainstay benchmark datasets.
arXiv Detail & Related papers (2023-04-26T04:52:08Z) - Visual-Semantic Contrastive Alignment for Few-Shot Image Classification [1.109560166867076]
Few-Shot learning aims to train a model that can adapt to unseen visual classes with only a few labeled examples.
We introduce a contrastive alignment mechanism for visual and semantic feature vectors to learn much more generalized visual concepts.
Our method simply adds an auxiliary contrastive learning objective which captures the contextual knowledge of a visual category.
arXiv Detail & Related papers (2022-10-20T03:59:40Z) - HIRL: A General Framework for Hierarchical Image Representation Learning [54.12773508883117]
We propose a general framework for Hierarchical Image Representation Learning (HIRL)
This framework aims to learn multiple semantic representations for each image, and these representations are structured to encode image semantics from fine-grained to coarse-grained.
Based on a probabilistic factorization, HIRL learns the most fine-grained semantics by an off-the-shelf image SSL approach and learns multiple coarse-grained semantics by a novel semantic path discrimination scheme.
arXiv Detail & Related papers (2022-05-26T05:13:26Z) - Wave-SAN: Wavelet based Style Augmentation Network for Cross-Domain
Few-Shot Learning [95.78635058475439]
Cross-domain few-shot learning aims at transferring knowledge from general nature images to novel domain-specific target categories.
This paper studies the problem of CD-FSL by spanning the style distributions of the source dataset.
To make our model robust to visual styles, the source images are augmented by swapping the styles of their low-frequency components with each other.
arXiv Detail & Related papers (2022-03-15T05:36:41Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z) - Aligning Visual Prototypes with BERT Embeddings for Few-Shot Learning [48.583388368897126]
Few-shot learning is the task of learning to recognize previously unseen categories of images.
We propose a method that takes into account the names of the image classes.
arXiv Detail & Related papers (2021-05-21T08:08:28Z) - Seed the Views: Hierarchical Semantic Alignment for Contrastive
Representation Learning [116.91819311885166]
We propose a hierarchical semantic alignment strategy via expanding the views generated by a single image to textbfCross-samples and Multi-level representation.
Our method, termed as CsMl, has the ability to integrate multi-level visual representations across samples in a robust way.
arXiv Detail & Related papers (2020-12-04T17:26:24Z) - Match Them Up: Visually Explainable Few-shot Image Classification [27.867833878756553]
Few-shot learning is usually based on an assumption that the pre-trained knowledge can be obtained from base (seen) categories and can be well transferred to novel (unseen) categories.
In this paper, we reveal a new way to perform FSL for image classification, using visual representations from the backbone model and weights generated by a newly-emerged explainable classifier.
Experimental results prove that the proposed method can achieve both good accuracy and satisfactory explainability on three mainstream datasets.
arXiv Detail & Related papers (2020-11-25T05:47:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.