Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning
- URL: http://arxiv.org/abs/2404.07713v2
- Date: Mon, 22 Jul 2024 10:09:39 GMT
- Title: Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning
- Authors: Shiming Chen, Wenjin Hou, Salman Khan, Fahad Shahbaz Khan,
- Abstract summary: We propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT)
ZSLViT first introduces semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement.
Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement.
- Score: 56.65891462413187
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zero-shot learning (ZSL) recognizes the unseen classes by conducting visual-semantic interactions to transfer semantic knowledge from seen classes to unseen ones, supported by semantic information (e.g., attributes). However, existing ZSL methods simply extract visual features using a pre-trained network backbone (i.e., CNN or ViT), which fail to learn matched visual-semantic correspondences for representing semantic-related visual features as lacking of the guidance of semantic information, resulting in undesirable visual-semantic interactions. To tackle this issue, we propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT). ZSLViT mainly considers two properties in the whole network: i) discover the semantic-related visual representations explicitly, and ii) discard the semantic-unrelated visual information. Specifically, we first introduce semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement and discover the semantic-related visual tokens explicitly with semantic-guided token attention. Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement. These two operations are integrated into various encoders to progressively learn semantic-related visual representations for accurate visual-semantic interactions in ZSL. The extensive experiments show that our ZSLViT achieves significant performance gains on three popular benchmark datasets, i.e., CUB, SUN, and AWA2. Codes are available at: https://github.com/shiming-chen/ZSLViT .
Related papers
- ZeroMamba: Exploring Visual State Space Model for Zero-Shot Learning [28.52949450389388]
Zero-shot learning (ZSL) aims to recognize unseen classes by transferring semantic knowledge from seen classes to unseen ones.
We propose a parameter-efficient ZSL framework called ZeroMamba to advance ZSL.
ZeroMamba comprises three key components: Semantic-aware Local Projection (SLP), Global Representation Learning (GRL), and Semantic Fusion (SeF)
arXiv Detail & Related papers (2024-08-27T08:39:47Z) - Dual Relation Mining Network for Zero-Shot Learning [48.89161627050706]
We propose a Dual Relation Mining Network (DRMN) to enable effective visual-semantic interactions and learn semantic relationship among attributes for knowledge transfer.
Specifically, we introduce a Dual Attention Block (DAB) for visual-semantic relationship mining, which enriches visual information by multi-level feature fusion.
For semantic relationship modeling, we utilize a Semantic Interaction Transformer (SIT) to enhance the generalization of attribute representations among images.
arXiv Detail & Related papers (2024-05-06T16:31:19Z) - Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot
Learning [74.48337375174297]
Generalized Zero-Shot Learning (GZSL) identifies unseen categories by knowledge transferred from the seen domain.
We deploy the dual semantic-visual transformer module (DSVTM) to progressively model the correspondences between prototypes and visual features.
DSVTM devises an instance-motivated semantic encoder that learns instance-centric prototypes to adapt to different images, enabling the recast of the unmatched semantic-visual pair into the matched one.
arXiv Detail & Related papers (2023-03-27T15:21:43Z) - Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR)
Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner.
Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z) - MSDN: Mutually Semantic Distillation Network for Zero-Shot Learning [28.330268557106912]
Key challenge of zero-shot learning (ZSL) is how to infer the latent semantic knowledge between visual and attribute features on seen classes.
We propose a Mutually Semantic Distillation Network (MSDN), which progressively distills the intrinsic semantic representations between visual and attribute features.
arXiv Detail & Related papers (2022-03-07T05:27:08Z) - TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning [119.43299939907685]
Zero-shot learning (ZSL) tackles the novel class recognition problem by transferring semantic knowledge from seen classes to unseen ones.
Existing attention-based models have struggled to learn inferior region features in a single image by solely using unidirectional attention.
We propose a cross attribute-guided Transformer network, termed TransZero++, to refine visual features and learn accurate attribute localization for semantic-augmented visual embedding representations.
arXiv Detail & Related papers (2021-12-16T05:49:51Z) - Visual-Semantic Transformer for Scene Text Recognition [5.323568551229187]
We propose to model semantic and visual information jointly with a Visual-Semantic Transformer (VST)
The VST first explicitly extracts primary semantic information from visual feature maps.
The semantic information is then joined with the visual feature maps to form a pseudo multi-domain sequence.
arXiv Detail & Related papers (2021-12-02T02:59:56Z) - Semantic Disentangling Generalized Zero-Shot Learning [50.259058462272435]
Generalized Zero-Shot Learning (GZSL) aims to recognize images from both seen and unseen categories.
In this paper, we propose a novel feature disentangling approach based on an encoder-decoder architecture.
The proposed model aims to distill quality semantic-consistent representations that capture intrinsic features of seen images.
arXiv Detail & Related papers (2021-01-20T05:46:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.