DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning
- URL: http://arxiv.org/abs/2207.01328v1
- Date: Mon, 4 Jul 2022 11:12:12 GMT
- Title: DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning
- Authors: Zhuo Chen, Yufeng Huang, Jiaoyan Chen, Yuxia Geng, Wen Zhang, Yin
Fang, Jeff Z. Pan, Wenting Song, Huajun Chen
- Abstract summary: We present a transformer-based end-to-end ZSL method named DUET.
We develop a cross-modal semantic grounding network to investigate the model's capability of disentangling semantic attributes from the images.
We find that DUET can often achieve state-of-the-art performance, its components are effective and its predictions are interpretable.
- Score: 37.48292304239107
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zero-shot learning (ZSL) aims to predict unseen classes whose samples have
never appeared during training, often utilizing additional semantic information
(a.k.a. side information) to bridge the training (seen) classes and the unseen
classes. One of the most effective and widely used semantic information for
zero-shot image classification are attributes which are annotations for
class-level visual characteristics. However, due to the shortage of
fine-grained annotations, the attribute imbalance and co-occurrence, the
current methods often fail to discriminate those subtle visual distinctions
between images, which limits their performances. In this paper, we present a
transformer-based end-to-end ZSL method named DUET, which integrates latent
semantic knowledge from the pretrained language models (PLMs) via a
self-supervised multi-modal learning paradigm. Specifically, we (1) developed a
cross-modal semantic grounding network to investigate the model's capability of
disentangling semantic attributes from the images, (2) applied an
attribute-level contrastive learning strategy to further enhance the model's
discrimination on fine-grained visual characteristics against the attribute
co-occurrence and imbalance, and (3) proposed a multi-task learning policy for
considering multi-model objectives. With extensive experiments on three
standard ZSL benchmarks and a knowledge graph equipped ZSL benchmark, we find
that DUET can often achieve state-of-the-art performance, its components are
effective and its predictions are interpretable.
Related papers
- Envisioning Class Entity Reasoning by Large Language Models for Few-shot Learning [13.68867780184022]
Few-shot learning aims to recognize new concepts using a limited number of visual samples.
Our framework incorporates both the abstract class semantics and the concrete class entities extracted from Large Language Models (LLMs)
For the challenging one-shot setting, our approach, utilizing the ResNet-12 backbone, achieves an average improvement of 1.95% over the second-best competitor.
arXiv Detail & Related papers (2024-08-22T15:10:20Z) - CREST: Cross-modal Resonance through Evidential Deep Learning for Enhanced Zero-Shot Learning [48.46511584490582]
Zero-shot learning (ZSL) enables the recognition of novel classes by leveraging semantic knowledge transfer from known to unknown categories.
Real-world challenges such as distribution imbalances and attribute co-occurrence hinder the discernment of local variances in images.
We propose a bidirectional cross-modal ZSL approach CREST to overcome these challenges.
arXiv Detail & Related papers (2024-04-15T10:19:39Z) - Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning [45.25602203155762]
Self-Supervised Contrastive Learning has proven effective in deriving high-quality representations from unlabeled data.
A major challenge that hinders both unimodal and multimodal contrastive learning is feature suppression.
We propose a novel model-agnostic Multistage Contrastive Learning framework.
arXiv Detail & Related papers (2024-02-19T04:13:33Z) - A Probabilistic Model Behind Self-Supervised Learning [53.64989127914936]
In self-supervised learning (SSL), representations are learned via an auxiliary task without annotated labels.
We present a generative latent variable model for self-supervised learning.
We show that several families of discriminative SSL, including contrastive methods, induce a comparable distribution over representations.
arXiv Detail & Related papers (2024-02-02T13:31:17Z) - Zero-Shot Learning by Harnessing Adversarial Samples [52.09717785644816]
We propose a novel Zero-Shot Learning (ZSL) approach by Harnessing Adversarial Samples (HAS)
HAS advances ZSL through adversarial training which takes into account three crucial aspects.
We demonstrate the effectiveness of our adversarial samples approach in both ZSL and Generalized Zero-Shot Learning (GZSL) scenarios.
arXiv Detail & Related papers (2023-08-01T06:19:13Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning [11.66422653137002]
We propose an attention-based model in the problem settings of Zero-Shot Learning to learn attributes useful for unseen class recognition.
Our method uses an attention mechanism adapted from Vision Transformer to capture and learn discriminative attributes by splitting images into small patches.
arXiv Detail & Related papers (2021-07-30T19:08:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.