Masking Modalities for Cross-modal Video Retrieval
- URL: http://arxiv.org/abs/2111.01300v2
- Date: Wed, 3 Nov 2021 12:36:48 GMT
- Title: Masking Modalities for Cross-modal Video Retrieval
- Authors: Valentin Gabeur, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia
Schmid
- Abstract summary: A common strategy for pre-training video encoders is to use the accompanying speech as weak supervision.
We propose to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech.
We show the superior performance of our "modality masking" pre-training approach for video retrieval on the How2R, YouCook2 and Condensed Movies datasets.
- Score: 93.10669981708878
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-training on large scale unlabelled datasets has shown impressive
performance improvements in the fields of computer vision and natural language
processing. Given the advent of large-scale instructional video datasets, a
common strategy for pre-training video encoders is to use the accompanying
speech as weak supervision. However, as speech is used to supervise the
pre-training, it is never seen by the video encoder, which does not learn to
process that modality. We address this drawback of current pre-training
methods, which fail to exploit the rich cues in spoken language. Our proposal
is to pre-train a video encoder using all the available video modalities as
supervision, namely, appearance, sound, and transcribed speech. We mask an
entire modality in the input and predict it using the other two modalities.
This encourages each modality to collaborate with the others, and our video
encoder learns to process appearance and audio as well as speech. We show the
superior performance of our "modality masking" pre-training approach for video
retrieval on the How2R, YouCook2 and Condensed Movies datasets.
Related papers
- video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models [27.54879344983513]
Video-SALMONN can understand not only visual frame sequences, audio events and music, but speech as well.
Video-SALMONN demonstrates remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other av-LLMs.
arXiv Detail & Related papers (2024-06-22T01:36:11Z) - Unified Video-Language Pre-training with Synchronized Audio [21.607860535968356]
We propose an enhanced framework for Video-Language pre-training with Synchronized Audio.
Our framework learns tri-modal representations in a unified self-supervised transformer.
Our model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines.
arXiv Detail & Related papers (2024-05-12T07:59:46Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - End-to-end Generative Pretraining for Multimodal Video Captioning [82.79187814057313]
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos.
Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly.
Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks.
arXiv Detail & Related papers (2022-01-20T16:16:21Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.