MERLOT: Multimodal Neural Script Knowledge Models
- URL: http://arxiv.org/abs/2106.02636v1
- Date: Fri, 4 Jun 2021 17:57:39 GMT
- Title: MERLOT: Multimodal Neural Script Knowledge Models
- Authors: Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park,
Jize Cao, Ali Farhadi, Yejin Choi
- Abstract summary: We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech.
MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets.
On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
- Score: 74.05631672657452
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As humans, we understand events in the visual world contextually, performing
multimodal reasoning across time to make inferences about the past, present,
and future. We introduce MERLOT, a model that learns multimodal script
knowledge by watching millions of YouTube videos with transcribed speech -- in
an entirely label-free, self-supervised manner. By pretraining with a mix of
both frame-level (spatial) and video-level (temporal) objectives, our model not
only learns to match images to temporally corresponding words, but also to
contextualize what is happening globally over time. As a result, MERLOT
exhibits strong out-of-the-box representations of temporal commonsense, and
achieves state-of-the-art performance on 12 different video QA datasets when
finetuned. It also transfers well to the world of static images, allowing
models to reason about the dynamic context behind visual scenes. On Visual
Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy,
outperforming state-of-the-art models of similar size by over 3%, even those
that make heavy use of auxiliary supervised data (like object bounding boxes).
Ablation analyses demonstrate the complementary importance of: 1) training on
videos versus static images; 2) scaling the magnitude and diversity of the
pretraining video corpus; and 3) using diverse objectives that encourage
full-stack multimodal reasoning, from the recognition to cognition level.
Related papers
- TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos.
It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips.
Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - Video In-context Learning [46.40277880351059]
In this paper, we study video in-context learning, where the model starts from an existing video clip and generates diverse potential future sequences.
To achieve this, we provide a clear definition of the task, and train an autoregressive Transformer on video datasets.
We design various evaluation metrics, including both objective and subjective measures, to demonstrate the visual quality and semantic accuracy of generation results.
arXiv Detail & Related papers (2024-07-10T04:27:06Z) - Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action [46.76487873983082]
Unified-IO 2 is the first autoregressive multimodal model capable of understanding and generating image, text, audio, and action.
We train our model from scratch on a large multimodal pre-training corpus from diverse sources.
With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark.
arXiv Detail & Related papers (2023-12-28T17:57:06Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z) - Understanding ME? Multimodal Evaluation for Fine-grained Visual
Commonsense [98.70218717851665]
It is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources.
We present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge.
We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation.
arXiv Detail & Related papers (2022-11-10T21:44:33Z) - Audio-visual Generalised Zero-shot Learning with Cross-modal Attention
and Language [38.02396786726476]
We propose to learn multi-modal representations from audio-visual data using cross-modal attention.
In our generalised audio-visual zero-shot learning setting, we include all the training classes in the test-time search space.
Due to the lack of a unified benchmark in this domain, we introduce a (generalised) zero-shot learning benchmark on three audio-visual datasets.
arXiv Detail & Related papers (2022-03-07T18:52:13Z) - MERLOT Reserve: Neural Script Knowledge through Vision and Language and
Sound [90.1857707251566]
We introduce MERLOT Reserve, a model that represents videos jointly over time.
We replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet.
Our objective learns faster than alternatives, and performs well at scale.
arXiv Detail & Related papers (2022-01-07T19:00:21Z) - Where and When: Space-Time Attention for Audio-Visual Explanations [42.093794819606444]
We propose a novel space-time attention network that uncovers the synergistic dynamics of audio and visual data over both space and time.
Our model is capable of predicting the audio-visual video events, while justifying its decision by localizing where the relevant visual cues appear.
arXiv Detail & Related papers (2021-05-04T14:16:55Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.