Audiovisual Masked Autoencoders
- URL: http://arxiv.org/abs/2212.05922v3
- Date: Thu, 4 Jan 2024 16:52:34 GMT
- Title: Audiovisual Masked Autoencoders
- Authors: Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tudor Ionescu, Mario
Lucic, Cordelia Schmid, Anurag Arnab
- Abstract summary: We show that we can achieve significant improvements on audiovisual downstream classification tasks.
We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens.
- Score: 93.22646144125457
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Can we leverage the audiovisual information already present in video to
improve self-supervised representation learning? To answer this question, we
study various pretraining architectures and objectives within the masked
autoencoding framework, motivated by the success of similar methods in natural
language and image understanding. We show that we can achieve significant
improvements on audiovisual downstream classification tasks, surpassing the
state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our
audiovisual pretraining scheme for multiple unimodal downstream tasks using a
single audiovisual pretrained model. We additionally demonstrate the
transferability of our representations, achieving state-of-the-art audiovisual
results on Epic Kitchens without pretraining specifically for this dataset.
Related papers
- Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - MAViL: Masked Audio-Video Learners [68.61844803682145]
We present Masked Audio-Video learners (MAViL) to train audio-visual representations.
Pre-training with MAViL enables the model to perform well in audio-visual classification and retrieval tasks.
For the first time, a self-supervised audio-visual model outperforms ones that use external supervision on benchmarks.
arXiv Detail & Related papers (2022-12-15T18:59:59Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.