MERLOT Reserve: Neural Script Knowledge through Vision and Language and
Sound
- URL: http://arxiv.org/abs/2201.02639v1
- Date: Fri, 7 Jan 2022 19:00:21 GMT
- Title: MERLOT Reserve: Neural Script Knowledge through Vision and Language and
Sound
- Authors: Rowan Zellers and Jiasen Lu and Ximing Lu and Youngjae Yu and Yanpeng
Zhao and Mohammadreza Salehi and Aditya Kusupati and Jack Hessel and Ali
Farhadi and Yejin Choi
- Abstract summary: We introduce MERLOT Reserve, a model that represents videos jointly over time.
We replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet.
Our objective learns faster than alternatives, and performs well at scale.
- Score: 90.1857707251566
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As humans, we navigate the world through all our senses, using perceptual
input from each one to correct the others. We introduce MERLOT Reserve, a model
that represents videos jointly over time -- through a new training objective
that learns from audio, subtitles, and video frames. Given a video, we replace
snippets of text and audio with a MASK token; the model learns by choosing the
correct masked-out snippet. Our objective learns faster than alternatives, and
performs well at scale: we pretrain on 20 million YouTube videos.
Empirical results show that MERLOT Reserve learns strong representations
about videos through all constituent modalities. When finetuned, it sets a new
state-of-the-art on both VCR and TVQA, outperforming prior work by 5% and 7%
respectively. Ablations show that both tasks benefit from audio pretraining --
even VCR, a QA task centered around images (without sound). Moreover, our
objective enables out-of-the-box prediction, revealing strong multimodal
commonsense understanding. In a fully zero-shot setting, our model obtains
competitive results on four video understanding tasks, even outperforming
supervised approaches on the recently proposed Situated Reasoning (STAR)
benchmark.
We analyze why incorporating audio leads to better vision-language
representations, suggesting significant opportunities for future research. We
conclude by discussing ethical and societal implications of multimodal
pretraining.
Related papers
- video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models [27.54879344983513]
Video-SALMONN can understand not only visual frame sequences, audio events and music, but speech as well.
Video-SALMONN demonstrates remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other av-LLMs.
arXiv Detail & Related papers (2024-06-22T01:36:11Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Jointly Learning Visual and Auditory Speech Representations from Raw
Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations.
Our design is asymmetric w.r.t. driven by the inherent differences between video and audio.
RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z) - Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks [20.316239155843963]
We propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks.
On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset.
arXiv Detail & Related papers (2021-10-14T12:32:40Z) - MERLOT: Multimodal Neural Script Knowledge Models [74.05631672657452]
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech.
MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets.
On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
arXiv Detail & Related papers (2021-06-04T17:57:39Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.