Rich Semantics Improve Few-shot Learning
- URL: http://arxiv.org/abs/2104.12709v1
- Date: Mon, 26 Apr 2021 16:48:27 GMT
- Title: Rich Semantics Improve Few-shot Learning
- Authors: Mohamed Afham, Salman Khan, Muhammad Haris Khan, Muzammal Naseer,
Fahad Shahbaz Khan
- Abstract summary: We show that by using 'class-level' language descriptions, that can be acquired with minimal annotation cost, we can improve the few-shot learning performance.
We develop a Transformer based forward and backward encoding mechanism to relate visual and semantic tokens.
- Score: 49.11659525563236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human learning benefits from multi-modal inputs that often appear as rich
semantics (e.g., description of an object's attributes while learning about
it). This enables us to learn generalizable concepts from very limited visual
examples. However, current few-shot learning (FSL) methods use numerical class
labels to denote object classes which do not provide rich semantic meanings
about the learned concepts. In this work, we show that by using 'class-level'
language descriptions, that can be acquired with minimal annotation cost, we
can improve the FSL performance. Given a support set and queries, our main idea
is to create a bottleneck visual feature (hybrid prototype) which is then used
to generate language descriptions of the classes as an auxiliary task during
training. We develop a Transformer based forward and backward encoding
mechanism to relate visual and semantic tokens that can encode intricate
relationships between the two modalities. Forcing the prototypes to retain
semantic information about class description acts as a regularizer on the
visual features, improving their generalization to novel classes at inference.
Furthermore, this strategy imposes a human prior on the learned
representations, ensuring that the model is faithfully relating visual and
semantic concepts, thereby improving model interpretability. Our experiments on
four datasets and ablation studies show the benefit of effectively modeling
rich semantics for FSL.
Related papers
- SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Bidirectional Representations for Low Resource Spoken Language
Understanding [39.208462511430554]
We propose a representation model to encode speech in bidirectional rich encodings.
The approach uses a masked language modelling objective to learn the representations.
We show that the performance of the resulting encodings is better than comparable models on multiple datasets.
arXiv Detail & Related papers (2022-11-24T17:05:16Z) - Visual-Semantic Contrastive Alignment for Few-Shot Image Classification [1.109560166867076]
Few-Shot learning aims to train a model that can adapt to unseen visual classes with only a few labeled examples.
We introduce a contrastive alignment mechanism for visual and semantic feature vectors to learn much more generalized visual concepts.
Our method simply adds an auxiliary contrastive learning objective which captures the contextual knowledge of a visual category.
arXiv Detail & Related papers (2022-10-20T03:59:40Z) - Semantic Cross Attention for Few-shot Learning [9.529264466445236]
We propose a multi-task learning approach to view semantic features of label text as an auxiliary task.
Our proposed model uses word-embedding representations as semantic features to help train the embedding network and a semantic cross-attention module to bridge the semantic features into the typical visual modal.
arXiv Detail & Related papers (2022-10-12T15:24:59Z) - Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data.
We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z) - Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning.
We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z) - VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning [113.50220968583353]
We propose to discover semantic embeddings containing discriminative visual properties for zero-shot learning.
Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity.
We demonstrate that our visually-grounded semantic embeddings further improve performance over word embeddings across various ZSL models by a large margin.
arXiv Detail & Related papers (2022-03-20T03:49:02Z) - Semantic Disentangling Generalized Zero-Shot Learning [50.259058462272435]
Generalized Zero-Shot Learning (GZSL) aims to recognize images from both seen and unseen categories.
In this paper, we propose a novel feature disentangling approach based on an encoder-decoder architecture.
The proposed model aims to distill quality semantic-consistent representations that capture intrinsic features of seen images.
arXiv Detail & Related papers (2021-01-20T05:46:21Z) - Webly Supervised Semantic Embeddings for Large Scale Zero-Shot Learning [8.472636806304273]
Zero-shot learning (ZSL) makes object recognition in images possible in absence of visual training data for a part of the classes from a dataset.
We focus on the problem of semantic class prototype design for large scale ZSL.
We investigate the use of noisy textual metadata associated to photos as text collections.
arXiv Detail & Related papers (2020-08-06T21:33:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.