Audio Representation Learning by Distilling Video as Privileged
Information
- URL: http://arxiv.org/abs/2302.02845v1
- Date: Mon, 6 Feb 2023 15:09:34 GMT
- Title: Audio Representation Learning by Distilling Video as Privileged
Information
- Authors: Amirhossein Hajavi, Ali Etemad
- Abstract summary: We propose a novel approach for deep audio representation learning using audio-visual data when the video modality is absent at inference.
We adopt teacher-student knowledge distillation under the framework of learning using privileged information (LUPI)
We show considerable improvements over sole audio-based recognition as well as prior works that use LUPI.
- Score: 25.71206255965502
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Deep audio representation learning using multi-modal audio-visual data often
leads to a better performance compared to uni-modal approaches. However, in
real-world scenarios both modalities are not always available at the time of
inference, leading to performance degradation by models trained for multi-modal
inference. In this work, we propose a novel approach for deep audio
representation learning using audio-visual data when the video modality is
absent at inference. For this purpose, we adopt teacher-student knowledge
distillation under the framework of learning using privileged information
(LUPI). While the previous methods proposed for LUPI use soft-labels generated
by the teacher, in our proposed method we use embeddings learned by the teacher
to train the student network. We integrate our method in two different
settings: sequential data where the features are divided into multiple segments
throughout time, and non-sequential data where the entire features are treated
as one whole segment. In the non-sequential setting both the teacher and
student networks are comprised of an encoder component and a task header. We
use the embeddings produced by the encoder component of the teacher to train
the encoder of the student, while the task header of the student is trained
using ground-truth labels. In the sequential setting, the networks have an
additional aggregation component that is placed between the encoder and task
header. We use two sets of embeddings produced by the encoder and aggregation
component of the teacher to train the student. Similar to the non-sequential
setting, the task header of the student network is trained using ground-truth
labels. We test our framework on two different audio-visual tasks, namely
speaker recognition and speech emotion recognition and show considerable
improvements over sole audio-based recognition as well as prior works that use
LUPI.
Related papers
- Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Audio-visual Generalised Zero-shot Learning with Cross-modal Attention
and Language [38.02396786726476]
We propose to learn multi-modal representations from audio-visual data using cross-modal attention.
In our generalised audio-visual zero-shot learning setting, we include all the training classes in the test-time search space.
Due to the lack of a unified benchmark in this domain, we introduce a (generalised) zero-shot learning benchmark on three audio-visual datasets.
arXiv Detail & Related papers (2022-03-07T18:52:13Z) - Learning music audio representations via weak language supervision [14.335950077921435]
We design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks.
weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track.
We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies.
arXiv Detail & Related papers (2021-12-08T10:30:52Z) - Masking Modalities for Cross-modal Video Retrieval [93.10669981708878]
A common strategy for pre-training video encoders is to use the accompanying speech as weak supervision.
We propose to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech.
We show the superior performance of our "modality masking" pre-training approach for video retrieval on the How2R, YouCook2 and Condensed Movies datasets.
arXiv Detail & Related papers (2021-11-01T23:55:04Z) - Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks [20.316239155843963]
We propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks.
On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset.
arXiv Detail & Related papers (2021-10-14T12:32:40Z) - Multi-task Voice-Activated Framework using Self-supervised Learning [0.9864260997723973]
Self-supervised learning methods such as wav2vec 2.0 have shown promising results in learning speech representations from unlabelled and untranscribed speech data.
We propose a general purpose framework for adapting a pre-trained wav2vec 2.0 model for different voice-activated tasks.
arXiv Detail & Related papers (2021-10-03T19:28:57Z) - Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural
Sounds [118.54908665440826]
Humans can robustly recognize and localize objects by using visual and/or auditory cues.
This work develops an approach for scene understanding purely based on sounds.
The co-existence of visual and audio cues is leveraged for supervision transfer.
arXiv Detail & Related papers (2021-09-06T22:24:00Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Distilling Audio-Visual Knowledge by Compositional Contrastive Learning [51.20935362463473]
We learn a compositional embedding that closes the cross-modal semantic gap.
We establish a new, comprehensive multi-modal distillation benchmark on three video datasets.
arXiv Detail & Related papers (2021-04-22T09:31:20Z) - Semantic Object Prediction and Spatial Sound Super-Resolution with
Binaural Sounds [106.87299276189458]
Humans can robustly recognize and localize objects by integrating visual and auditory cues.
This work develops an approach for dense semantic labelling of sound-making objects, purely based on sounds.
arXiv Detail & Related papers (2020-03-09T15:49:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.