MAViL: Masked Audio-Video Learners
- URL: http://arxiv.org/abs/2212.08071v2
- Date: Mon, 17 Jul 2023 05:44:35 GMT
- Title: MAViL: Masked Audio-Video Learners
- Authors: Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, Haoqi Fan, Yanghao
Li, Shang-Wen Li, Gargi Ghosh, Jitendra Malik, Christoph Feichtenhofer
- Abstract summary: We present Masked Audio-Video learners (MAViL) to train audio-visual representations.
Pre-training with MAViL enables the model to perform well in audio-visual classification and retrieval tasks.
For the first time, a self-supervised audio-visual model outperforms ones that use external supervision on benchmarks.
- Score: 68.61844803682145
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Masked Audio-Video Learners (MAViL) to train audio-visual
representations. Our approach learns with three complementary forms of
self-supervision: (1) reconstruction of masked audio and video input data, (2)
intra- and inter-modal contrastive learning with masking, and (3) self-training
by reconstructing joint audio-video contextualized features learned from the
first two objectives. Pre-training with MAViL not only enables the model to
perform well in audio-visual classification and retrieval tasks but also
improves representations of each modality in isolation, without using
information from the other modality for fine-tuning or inference. Empirically,
MAViL sets a new state-of-the-art on AudioSet (53.1 mAP) and VGGSound (67.1%
accuracy). For the first time, a self-supervised audio-visual model outperforms
ones that use external supervision on these benchmarks.
Related papers
- Diffusion Models as Masked Audio-Video Learners [27.22726553443404]
Masked Audio-Video learners (MAViL) has emerged as a state-of-the-art audio-video pre-training framework.
In this paper, we study the potential synergy between diffusion models and MAViL.
arXiv Detail & Related papers (2023-10-05T23:00:27Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - Audiovisual Masked Autoencoders [93.22646144125457]
We show that we can achieve significant improvements on audiovisual downstream classification tasks.
We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens.
arXiv Detail & Related papers (2022-12-09T17:34:53Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.