Generalized Zero-Shot Learning using Multimodal Variational Auto-Encoder
with Semantic Concepts
- URL: http://arxiv.org/abs/2106.14082v1
- Date: Sat, 26 Jun 2021 20:08:37 GMT
- Title: Generalized Zero-Shot Learning using Multimodal Variational Auto-Encoder
with Semantic Concepts
- Authors: Nihar Bendre, Kevin Desai and Peyman Najafirad
- Abstract summary: Recent techniques try to learn a cross-modal mapping between the semantic space and the image space.
We propose a Multimodal Variational Auto-Encoder (M-VAE) which can learn the shared latent space of image features and the semantic space.
Our results show that our proposed model outperforms the current state-of-the-art approaches for generalized zero-shot learning.
- Score: 0.9054540533394924
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the ever-increasing amount of data, the central challenge in multimodal
learning involves limitations of labelled samples. For the task of
classification, techniques such as meta-learning, zero-shot learning, and
few-shot learning showcase the ability to learn information about novel classes
based on prior knowledge. Recent techniques try to learn a cross-modal mapping
between the semantic space and the image space. However, they tend to ignore
the local and global semantic knowledge. To overcome this problem, we propose a
Multimodal Variational Auto-Encoder (M-VAE) which can learn the shared latent
space of image features and the semantic space. In our approach we concatenate
multimodal data to a single embedding before passing it to the VAE for learning
the latent space. We propose the use of a multi-modal loss during the
reconstruction of the feature embedding through the decoder. Our approach is
capable to correlating modalities and exploit the local and global semantic
knowledge for novel sample predictions. Our experimental results using a MLP
classifier on four benchmark datasets show that our proposed model outperforms
the current state-of-the-art approaches for generalized zero-shot learning.
Related papers
- TOMCAT: Test-time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning [35.14123452166428]
Compositional Zero-Shot Learning aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones.<n>Existing methods suffer from performance degradation caused by the distribution shift of label space at test time.<n>We propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities to update multimodal prototypes at test time.
arXiv Detail & Related papers (2025-10-23T03:20:29Z) - Connecting Giants: Synergistic Knowledge Transfer of Large Multimodal Models for Few-Shot Learning [61.73934102302588]
Few-shot learning addresses the challenge of classifying novel classes with limited training samples.<n>We propose a novel framework, Synergistic Knowledge Transfer, which effectively transfers diverse and complementary knowledge from large multimodal models.<n>We show that SynTrans, even when paired with a simple few-shot vision encoder, significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2025-10-13T08:06:23Z) - Can multimodal representation learning by alignment preserve modality-specific information? [2.0816054646359805]
multimodal representation learning techniques leverage the spatial alignment between satellite data from different modalities acquired over the same geographic area.<n>We show, under simplifying assumptions, when alignment strategies fundamentally lead to an information loss.<n>We hope to support new developments in contrastive learning for the combination of multimodal satellite data.
arXiv Detail & Related papers (2025-09-22T16:06:10Z) - A Zero-shot Learning Method Based on Large Language Models for Multi-modal Knowledge Graph Embedding [8.56384109338971]
Zero-shot learning (ZL) is crucial for tasks involving unseen categories, such as natural language processing, image classification, and cross-lingual transfer.
We proposeZSLLM, a framework for zero-shot embedding learning of MMKGs using largelanguage models (LLMs)
arXiv Detail & Related papers (2025-03-10T11:38:21Z) - Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods.
MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections.
Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z) - Reinforcement Learning Based Multi-modal Feature Fusion Network for
Novel Class Discovery [47.28191501836041]
In this paper, we employ a Reinforcement Learning framework to simulate the cognitive processes of humans.
We also deploy a Member-to-Leader Multi-Agent framework to extract and fuse features from multi-modal information.
We demonstrate the performance of our approach in both the 3D and 2D domains by employing the OS-MN40, OS-MN40-Miss, and Cifar10 datasets.
arXiv Detail & Related papers (2023-08-26T07:55:32Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Multi-View Class Incremental Learning [57.14644913531313]
Multi-view learning (MVL) has gained great success in integrating information from multiple perspectives of a dataset to improve downstream task performance.
This paper investigates a novel paradigm called multi-view class incremental learning (MVCIL), where a single model incrementally classifies new classes from a continual stream of views.
arXiv Detail & Related papers (2023-06-16T08:13:41Z) - Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph
Propagation [68.13453771001522]
We propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings.
We conduct extensive experiments and evaluate our model on large-scale real-world data.
arXiv Detail & Related papers (2023-06-14T13:07:48Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z) - Multimodal Clustering Networks for Self-supervised Learning from
Unlabeled Videos [69.61522804742427]
This paper proposes a self-supervised training framework that learns a common multimodal embedding space.
We extend the concept of instance-level contrastive learning with a multimodal clustering step to capture semantic similarities across modalities.
The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains.
arXiv Detail & Related papers (2021-04-26T15:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.