ZeroMamba: Exploring Visual State Space Model for Zero-Shot Learning
- URL: http://arxiv.org/abs/2408.14868v1
- Date: Tue, 27 Aug 2024 08:39:47 GMT
- Title: ZeroMamba: Exploring Visual State Space Model for Zero-Shot Learning
- Authors: Wenjin Hou, Dingjie Fu, Kun Li, Shiming Chen, Hehe Fan, Yi Yang,
- Abstract summary: Zero-shot learning (ZSL) aims to recognize unseen classes by transferring semantic knowledge from seen classes to unseen ones.
We propose a parameter-efficient ZSL framework called ZeroMamba to advance ZSL.
ZeroMamba comprises three key components: Semantic-aware Local Projection (SLP), Global Representation Learning (GRL), and Semantic Fusion (SeF)
- Score: 28.52949450389388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zero-shot learning (ZSL) aims to recognize unseen classes by transferring semantic knowledge from seen classes to unseen ones, guided by semantic information. To this end, existing works have demonstrated remarkable performance by utilizing global visual features from Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs) for visual-semantic interactions. Due to the limited receptive fields of CNNs and the quadratic complexity of ViTs, however, these visual backbones achieve suboptimal visual-semantic interactions. In this paper, motivated by the visual state space model (i.e., Vision Mamba), which is capable of capturing long-range dependencies and modeling complex visual dynamics, we propose a parameter-efficient ZSL framework called ZeroMamba to advance ZSL. Our ZeroMamba comprises three key components: Semantic-aware Local Projection (SLP), Global Representation Learning (GRL), and Semantic Fusion (SeF). Specifically, SLP integrates semantic embeddings to map visual features to local semantic-related representations, while GRL encourages the model to learn global semantic representations. SeF combines these two semantic representations to enhance the discriminability of semantic features. We incorporate these designs into Vision Mamba, forming an end-to-end ZSL framework. As a result, the learned semantic representations are better suited for classification. Through extensive experiments on four prominent ZSL benchmarks, ZeroMamba demonstrates superior performance, significantly outperforming the state-of-the-art (i.e., CNN-based and ViT-based) methods under both conventional ZSL (CZSL) and generalized ZSL (GZSL) settings. Code is available at: https://anonymous.4open.science/r/ZeroMamba.
Related papers
- Epsilon: Exploring Comprehensive Visual-Semantic Projection for Multi-Label Zero-Shot Learning [23.96220607033524]
This paper investigates a challenging problem of zero-shot learning in the multi-label scenario (MLZSL)
It is trained to recognize multiple unseen classes within a sample based on seen classes and auxiliary knowledge.
We propose a novel and comprehensive visual-semantic framework for MLZSL, dubbed Epsilon, to fully make use of such properties.
arXiv Detail & Related papers (2024-08-22T09:45:24Z) - Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning [56.16593809016167]
We propose a novel Visual-Augmented Dynamic Semantic prototype method (termed VADS) to boost the generator to learn accurate semantic-visual mapping.
VADS consists of two modules: (1) Visual-aware Domain Knowledge Learning module (VDKL) learns the local bias and global prior of the visual features, which replace pure Gaussian noise to provide richer prior noise information; (2) Vision-Oriented Semantic Updation module (VOSU) updates the semantic prototype according to the visual representations of the samples.
arXiv Detail & Related papers (2024-04-23T07:39:09Z) - Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning [56.65891462413187]
We propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT)
ZSLViT first introduces semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement.
Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement.
arXiv Detail & Related papers (2024-04-11T12:59:38Z) - Prompting Language-Informed Distribution for Compositional Zero-Shot Learning [73.49852821602057]
Compositional zero-shot learning (CZSL) task aims to recognize unseen compositional visual concepts.
We propose a model by prompting the language-informed distribution, aka., PLID, for the task.
Experimental results on MIT-States, UT-Zappos, and C-GQA datasets show the superior performance of the PLID to the prior arts.
arXiv Detail & Related papers (2023-05-23T18:00:22Z) - FREE: Feature Refinement for Generalized Zero-Shot Learning [86.41074134041394]
Generalized zero-shot learning (GZSL) has achieved significant progress, with many efforts dedicated to overcoming the problems of visual-semantic domain gap and seen-unseen bias.
Most existing methods directly use feature extraction models trained on ImageNet alone, ignoring the cross-dataset bias between ImageNet and GZSL benchmarks.
We propose a simple yet effective GZSL method, termed feature refinement for generalized zero-shot learning (FREE) to tackle the above problem.
arXiv Detail & Related papers (2021-07-29T08:11:01Z) - What Remains of Visual Semantic Embeddings [0.618778092044887]
We introduce the split of tiered-ImageNet to the ZSL task to avoid the structural flaws in the standard ImageNet benchmark.
We build a unified framework for ZSL with contrastive learning as pre-training, which guarantees no semantic information leakage.
Our work makes it fair for evaluating visual semantic embedding models on a ZSL setting in which semantic inference is decisive.
arXiv Detail & Related papers (2021-07-26T06:55:11Z) - Goal-Oriented Gaze Estimation for Zero-Shot Learning [62.52340838817908]
We introduce a novel goal-oriented gaze estimation module (GEM) to improve the discriminative attribute localization.
We aim to predict the actual human gaze location to get the visual attention regions for recognizing a novel object guided by attribute description.
This work implies the promising benefits of collecting human gaze dataset and automatic gaze estimation algorithms on high-level computer vision tasks.
arXiv Detail & Related papers (2021-03-05T02:14:57Z) - Zero-Shot Learning Based on Knowledge Sharing [0.0]
Zero-Shot Learning (ZSL) is an emerging research that aims to solve the classification problems with very few training data.
This paper introduces knowledge sharing (KS) to enrich the representation of semantic features.
Based on KS, we apply a generative adversarial network to generate pseudo visual features from semantic features that are very close to the real visual features.
arXiv Detail & Related papers (2021-02-26T06:43:29Z) - Information Bottleneck Constrained Latent Bidirectional Embedding for
Zero-Shot Learning [59.58381904522967]
We propose a novel embedding based generative model with a tight visual-semantic coupling constraint.
We learn a unified latent space that calibrates the embedded parametric distributions of both visual and semantic spaces.
Our method can be easily extended to transductive ZSL setting by generating labels for unseen images.
arXiv Detail & Related papers (2020-09-16T03:54:12Z) - Leveraging Seen and Unseen Semantic Relationships for Generative
Zero-Shot Learning [14.277015352910674]
We propose a generative model that explicitly performs knowledge transfer by incorporating a novel Semantic Regularized Loss (SR-Loss)
Experiments on seven benchmark datasets demonstrate the superiority of the LsrGAN compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2020-07-19T01:25:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.