Attend and Enrich: Enhanced Visual Prompt for Zero-Shot Learning
- URL: http://arxiv.org/abs/2406.03032v2
- Date: Tue, 10 Dec 2024 04:37:06 GMT
- Title: Attend and Enrich: Enhanced Visual Prompt for Zero-Shot Learning
- Authors: Man Liu, Huihui Bai, Feng Li, Chunjie Zhang, Yunchao Wei, Tat-Seng Chua, Yao Zhao,
- Abstract summary: We propose AENet, which endows semantic information into the visual prompt to distill semantic-enhanced prompt for visual representation enrichment.
AENet comprises two key steps: 1) exploring the concept-harmonized tokens for the visual and attribute modalities, grounded on the modal-sharing token that represents consistent visual-semantic concepts; and 2) yielding semantic-enhanced prompt via the visual residual refinement unit with attribute consistency supervision.
- Score: 114.59476118365266
- License:
- Abstract: Zero-shot learning (ZSL) endeavors to transfer knowledge from seen categories to recognize unseen categories, which mostly relies on the semantic-visual interactions between image and attribute tokens. Recently, prompt learning has emerged in ZSL and demonstrated significant potential as it allows the zero-shot transfer of diverse visual concepts to downstream tasks. However, current methods explore the fixed adaption of learnable prompt on seen domains, which makes them over-emphasize the primary visual features observed during training, limiting their generalization capabilities to unseen domains. In this work, we propose AENet, which endows semantic information into the visual prompt to distill semantic-enhanced prompt for visual representation enrichment, enabling effective knowledge transfer for ZSL. AENet comprises two key steps: 1) exploring the concept-harmonized tokens for the visual and attribute modalities, grounded on the modal-sharing token that represents consistent visual-semantic concepts; and 2) yielding semantic-enhanced prompt via the visual residual refinement unit with attribute consistency supervision. These are further integrated with primary visual features to attend to semantic-related information for visual enhancement, thus strengthening transferable ability. Experimental results on three benchmarks show that our AENet outperforms existing state-of-the-art ZSL methods. The code is provided in the zip file of supplementary materials.
Related papers
- Dual Relation Mining Network for Zero-Shot Learning [48.89161627050706]
We propose a Dual Relation Mining Network (DRMN) to enable effective visual-semantic interactions and learn semantic relationship among attributes for knowledge transfer.
Specifically, we introduce a Dual Attention Block (DAB) for visual-semantic relationship mining, which enriches visual information by multi-level feature fusion.
For semantic relationship modeling, we utilize a Semantic Interaction Transformer (SIT) to enhance the generalization of attribute representations among images.
arXiv Detail & Related papers (2024-05-06T16:31:19Z) - CREST: Cross-modal Resonance through Evidential Deep Learning for Enhanced Zero-Shot Learning [48.46511584490582]
Zero-shot learning (ZSL) enables the recognition of novel classes by leveraging semantic knowledge transfer from known to unknown categories.
Real-world challenges such as distribution imbalances and attribute co-occurrence hinder the discernment of local variances in images.
We propose a bidirectional cross-modal ZSL approach CREST to overcome these challenges.
arXiv Detail & Related papers (2024-04-15T10:19:39Z) - Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning [56.65891462413187]
We propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT)
ZSLViT first introduces semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement.
Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement.
arXiv Detail & Related papers (2024-04-11T12:59:38Z) - Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models [64.24227572048075]
We propose a Knowledge-Aware Prompt Tuning (KAPT) framework for vision-language models.
Our approach takes inspiration from human intelligence in which external knowledge is usually incorporated into recognizing novel categories of objects.
arXiv Detail & Related papers (2023-08-22T04:24:45Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR)
Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner.
Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z) - MSDN: Mutually Semantic Distillation Network for Zero-Shot Learning [28.330268557106912]
Key challenge of zero-shot learning (ZSL) is how to infer the latent semantic knowledge between visual and attribute features on seen classes.
We propose a Mutually Semantic Distillation Network (MSDN), which progressively distills the intrinsic semantic representations between visual and attribute features.
arXiv Detail & Related papers (2022-03-07T05:27:08Z) - Zero-Shot Learning Based on Knowledge Sharing [0.0]
Zero-Shot Learning (ZSL) is an emerging research that aims to solve the classification problems with very few training data.
This paper introduces knowledge sharing (KS) to enrich the representation of semantic features.
Based on KS, we apply a generative adversarial network to generate pseudo visual features from semantic features that are very close to the real visual features.
arXiv Detail & Related papers (2021-02-26T06:43:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.