AVGZSLNet: Audio-Visual Generalized Zero-Shot Learning by Reconstructing
Label Features from Multi-Modal Embeddings
- URL: http://arxiv.org/abs/2005.13402v3
- Date: Mon, 23 Nov 2020 06:13:16 GMT
- Title: AVGZSLNet: Audio-Visual Generalized Zero-Shot Learning by Reconstructing
Label Features from Multi-Modal Embeddings
- Authors: Pratik Mazumder, Pravendra Singh, Kranti Kumar Parida, Vinay P.
Namboodiri
- Abstract summary: We propose a novel approach for generalized zero-shot learning in a multi-modal setting.
We use the semantic relatedness of text embeddings as a means for zero-shot learning by aligning audio and video embeddings with the corresponding class label text feature space.
- Score: 37.3282534461213
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a novel approach for generalized zero-shot learning
in a multi-modal setting, where we have novel classes of audio/video during
testing that are not seen during training. We use the semantic relatedness of
text embeddings as a means for zero-shot learning by aligning audio and video
embeddings with the corresponding class label text feature space. Our approach
uses a cross-modal decoder and a composite triplet loss. The cross-modal
decoder enforces a constraint that the class label text features can be
reconstructed from the audio and video embeddings of data points. This helps
the audio and video embeddings to move closer to the class label text
embedding. The composite triplet loss makes use of the audio, video, and text
embeddings. It helps bring the embeddings from the same class closer and push
away the embeddings from different classes in a multi-modal setting. This helps
the network to perform better on the multi-modal zero-shot learning task.
Importantly, our multi-modal zero-shot learning approach works even if a
modality is missing at test time. We test our approach on the generalized
zero-shot classification and retrieval tasks and show that our approach
outperforms other models in the presence of a single modality as well as in the
presence of multiple modalities. We validate our approach by comparing it with
previous approaches and using various ablations.
Related papers
- PALM: Few-Shot Prompt Learning for Audio Language Models [1.6177972328875514]
Audio-Language Models (ALMs) have recently achieved remarkable success in zero-shot audio recognition tasks.
We propose a novel method, Prompt Learning in Audio Language Models (PALM), which optimize the feature space of the text encoder branch.
We demonstrate the effectiveness of our approach on 11 audio recognition datasets, and compare the results with three baselines in a few-shot learning setup.
arXiv Detail & Related papers (2024-09-29T22:06:07Z) - Audio-visual Generalized Zero-shot Learning the Easy Way [20.60905505473906]
We introduce EZ-AVGZL, which aligns audio-visual embeddings with transformed text representations.
We conduct extensive experiments on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL benchmarks.
arXiv Detail & Related papers (2024-07-18T01:57:16Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality
Alignment [57.15449072423539]
We propose a training system Open-modality Speech Recognition (textbfOpenSR)
OpenSR enables modality transfer from one to any in three different settings.
It achieves highly competitive zero-shot performance compared to the existing few-shot and full-shot lip-reading methods.
arXiv Detail & Related papers (2023-06-10T11:04:10Z) - Temporal and cross-modal attention for audio-visual zero-shot learning [38.02396786726476]
generalised zero-shot learning for video classification requires understanding the relations between the audio and visual information.
We propose a multi-modal and Temporal Cross-attention Framework (modelName) for audio-visual generalised zero-shot learning.
We show that our proposed framework that ingests temporal features yields state-of-the-art performance on the ucf, vgg, and activity benchmarks for (generalised) zero-shot learning.
arXiv Detail & Related papers (2022-07-20T15:19:30Z) - Clover: Towards A Unified Video-Language Alignment and Fusion Model [154.1070559563592]
We introduce Clover, a Correlated Video-Language pre-training method.
It improves cross-modal feature alignment and fusion via a novel tri-modal alignment pre-training task.
Clover establishes new state-of-the-arts on multiple downstream tasks.
arXiv Detail & Related papers (2022-07-16T09:38:52Z) - Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning.
We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z) - Audio-visual Generalised Zero-shot Learning with Cross-modal Attention
and Language [38.02396786726476]
We propose to learn multi-modal representations from audio-visual data using cross-modal attention.
In our generalised audio-visual zero-shot learning setting, we include all the training classes in the test-time search space.
Due to the lack of a unified benchmark in this domain, we introduce a (generalised) zero-shot learning benchmark on three audio-visual datasets.
arXiv Detail & Related papers (2022-03-07T18:52:13Z) - Distilling Audio-Visual Knowledge by Compositional Contrastive Learning [51.20935362463473]
We learn a compositional embedding that closes the cross-modal semantic gap.
We establish a new, comprehensive multi-modal distillation benchmark on three video datasets.
arXiv Detail & Related papers (2021-04-22T09:31:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.