Distilling Audio-Visual Knowledge by Compositional Contrastive Learning
- URL: http://arxiv.org/abs/2104.10955v1
- Date: Thu, 22 Apr 2021 09:31:20 GMT
- Title: Distilling Audio-Visual Knowledge by Compositional Contrastive Learning
- Authors: Yanbei Chen, Yongqin Xian, A. Sophia Koepke, Ying Shan, Zeynep Akata
- Abstract summary: We learn a compositional embedding that closes the cross-modal semantic gap.
We establish a new, comprehensive multi-modal distillation benchmark on three video datasets.
- Score: 51.20935362463473
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Having access to multi-modal cues (e.g. vision and audio) empowers some
cognitive tasks to be done faster compared to learning from a single modality.
In this work, we propose to transfer knowledge across heterogeneous modalities,
even though these data modalities may not be semantically correlated. Rather
than directly aligning the representations of different modalities, we compose
audio, image, and video representations across modalities to uncover richer
multi-modal knowledge. Our main idea is to learn a compositional embedding that
closes the cross-modal semantic gap and captures the task-relevant semantics,
which facilitates pulling together representations across modalities by
compositional contrastive learning. We establish a new, comprehensive
multi-modal distillation benchmark on three video datasets: UCF101,
ActivityNet, and VGGSound. Moreover, we demonstrate that our model
significantly outperforms a variety of existing knowledge distillation methods
in transferring audio-visual knowledge to improve video representation
learning. Code is released here: https://github.com/yanbeic/CCL.
Related papers
- SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos [77.55518265996312]
We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos.
Our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio, language, and vision when all modality pairs agree.
arXiv Detail & Related papers (2024-04-08T05:19:28Z) - Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph
Propagation [68.13453771001522]
We propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings.
We conduct extensive experiments and evaluate our model on large-scale real-world data.
arXiv Detail & Related papers (2023-06-14T13:07:48Z) - Accommodating Audio Modality in CLIP for Multimodal Processing [48.83906067348211]
We extend the Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing.
Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities.
Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning.
arXiv Detail & Related papers (2023-03-12T06:57:01Z) - Audio-visual Generalised Zero-shot Learning with Cross-modal Attention
and Language [38.02396786726476]
We propose to learn multi-modal representations from audio-visual data using cross-modal attention.
In our generalised audio-visual zero-shot learning setting, we include all the training classes in the test-time search space.
Due to the lack of a unified benchmark in this domain, we introduce a (generalised) zero-shot learning benchmark on three audio-visual datasets.
arXiv Detail & Related papers (2022-03-07T18:52:13Z) - Learnable Irrelevant Modality Dropout for Multimodal Action Recognition
on Modality-Specific Annotated Videos [10.478479158063982]
We propose a novel framework to effectively leverage the audio modality in vision-specific annotated videos for action recognition.
We build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels.
We also present a new two-stream video Transformer for efficiently modeling the visual modalities.
arXiv Detail & Related papers (2022-03-06T17:31:06Z) - Leveraging Uni-Modal Self-Supervised Learning for Multimodal
Audio-Visual Speech Recognition [23.239078852797817]
We leverage uni-modal self-supervised learning to promote the multimodal audio-visual speech recognition (AVSR)
In particular, we first train audio and visual encoders on a large-scale uni-modal dataset, then we integrate components of both encoders into a larger multimodal framework.
Our model is experimentally validated on both word-level and sentence-level AVSR tasks.
arXiv Detail & Related papers (2022-02-24T15:12:17Z) - TriBERT: Full-body Human-centric Audio-visual Representation Learning
for Visual Sound Separation [35.93516937521393]
We introduce TriBERT -- a transformer-based architecture inspired by ViLBERT.
TriBERT enables contextual feature learning across three modalities: vision, pose, and audio.
We show that the learned TriBERT representations are generic and significantly improve performance on other audio-visual tasks.
arXiv Detail & Related papers (2021-10-26T04:50:42Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams.
We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.