Audio-visual Generalised Zero-shot Learning with Cross-modal Attention
and Language
- URL: http://arxiv.org/abs/2203.03598v1
- Date: Mon, 7 Mar 2022 18:52:13 GMT
- Title: Audio-visual Generalised Zero-shot Learning with Cross-modal Attention
and Language
- Authors: Otniel-Bogdan Mercea, Lukas Riesch, A. Sophia Koepke, Zeynep Akata
- Abstract summary: We propose to learn multi-modal representations from audio-visual data using cross-modal attention.
In our generalised audio-visual zero-shot learning setting, we include all the training classes in the test-time search space.
Due to the lack of a unified benchmark in this domain, we introduce a (generalised) zero-shot learning benchmark on three audio-visual datasets.
- Score: 38.02396786726476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning to classify video data from classes not included in the training
data, i.e. video-based zero-shot learning, is challenging. We conjecture that
the natural alignment between the audio and visual modalities in video data
provides a rich training signal for learning discriminative multi-modal
representations. Focusing on the relatively underexplored task of audio-visual
zero-shot learning, we propose to learn multi-modal representations from
audio-visual data using cross-modal attention and exploit textual label
embeddings for transferring knowledge from seen classes to unseen classes.
Taking this one step further, in our generalised audio-visual zero-shot
learning setting, we include all the training classes in the test-time search
space which act as distractors and increase the difficulty while making the
setting more realistic. Due to the lack of a unified benchmark in this domain,
we introduce a (generalised) zero-shot learning benchmark on three audio-visual
datasets of varying sizes and difficulty, VGGSound, UCF, and ActivityNet,
ensuring that the unseen test classes do not appear in the dataset used for
supervised training of the backbone deep models. Comparing multiple relevant
and recent methods, we demonstrate that our proposed AVCA model achieves
state-of-the-art performance on all three datasets. Code and data will be
available at \url{https://github.com/ExplainableML/AVCA-GZSL}.
Related papers
- Unified Video-Language Pre-training with Synchronized Audio [21.607860535968356]
We propose an enhanced framework for Video-Language pre-training with Synchronized Audio.
Our framework learns tri-modal representations in a unified self-supervised transformer.
Our model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines.
arXiv Detail & Related papers (2024-05-12T07:59:46Z) - Class-Incremental Grouping Network for Continual Audio-Visual Learning [42.284785756540806]
We propose a class-incremental grouping network (CIGN) that can learn category-wise semantic features to achieve continual audio-visual learning.
We conduct extensive experiments on VGGSound-Instruments, VGGSound-100, and VGG-Sound Sources benchmarks.
Our experimental results demonstrate that the CIGN achieves state-of-the-art audio-visual class-incremental learning performance.
arXiv Detail & Related papers (2023-09-11T07:36:16Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Jointly Learning Visual and Auditory Speech Representations from Raw
Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations.
Our design is asymmetric w.r.t. driven by the inherent differences between video and audio.
RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Temporal and cross-modal attention for audio-visual zero-shot learning [38.02396786726476]
generalised zero-shot learning for video classification requires understanding the relations between the audio and visual information.
We propose a multi-modal and Temporal Cross-attention Framework (modelName) for audio-visual generalised zero-shot learning.
We show that our proposed framework that ingests temporal features yields state-of-the-art performance on the ucf, vgg, and activity benchmarks for (generalised) zero-shot learning.
arXiv Detail & Related papers (2022-07-20T15:19:30Z) - Distilling Audio-Visual Knowledge by Compositional Contrastive Learning [51.20935362463473]
We learn a compositional embedding that closes the cross-modal semantic gap.
We establish a new, comprehensive multi-modal distillation benchmark on three video datasets.
arXiv Detail & Related papers (2021-04-22T09:31:20Z) - Automatic Curation of Large-Scale Datasets for Audio-Visual
Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z) - AVLnet: Learning Audio-Visual Language Representations from
Instructional Videos [69.56522471911396]
We introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs.
We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks.
Our code, data, and trained models will be released at avlnet.csail.mit.edu.
arXiv Detail & Related papers (2020-06-16T14:38:03Z) - AVGZSLNet: Audio-Visual Generalized Zero-Shot Learning by Reconstructing
Label Features from Multi-Modal Embeddings [37.3282534461213]
We propose a novel approach for generalized zero-shot learning in a multi-modal setting.
We use the semantic relatedness of text embeddings as a means for zero-shot learning by aligning audio and video embeddings with the corresponding class label text feature space.
arXiv Detail & Related papers (2020-05-27T14:58:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.