Look Before you Speak: Visually Contextualized Utterances
- URL: http://arxiv.org/abs/2012.05710v2
- Date: Mon, 29 Mar 2021 01:54:02 GMT
- Title: Look Before you Speak: Visually Contextualized Utterances
- Authors: Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid
- Abstract summary: We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
- Score: 88.58909442073858
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While most conversational AI systems focus on textual dialogue only,
conditioning utterances on visual context (when it's available) can lead to
more realistic conversations. Unfortunately, a major challenge for
incorporating visual context into conversational dialogue is the lack of
large-scale labeled datasets. We provide a solution in the form of a new
visually conditioned Future Utterance Prediction task. Our task involves
predicting the next utterance in a video, using both visual frames and
transcribed speech as context. By exploiting the large number of instructional
videos online, we train a model to solve this task at scale, without the need
for manual annotations. Leveraging recent advances in multimodal learning, our
model consists of a novel co-attentional multimodal video transformer, and when
trained on both textual and visual context, outperforms baselines that use
textual inputs alone. Further, we demonstrate that our model trained for this
task on unlabelled videos achieves state-of-the-art performance on a number of
downstream VideoQA benchmarks such as MSRVTT-QA, MSVD-QA, ActivityNet-QA and
How2QA.
Related papers
- The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval [36.516226519328015]
Video-language tasks necessitate spatial and temporal comprehension and require significant compute.
This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval.
We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
arXiv Detail & Related papers (2024-06-26T06:59:09Z) - Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties [13.938281516499119]
We implement textbfEmergent textbfIn-context textbfLearning on textbfVideos (eilev), a novel training paradigm that induces in-context learning over video and text.
Our results, analysis, and eilev-trained models yield numerous insights about the emergence of in-context learning over video and text.
arXiv Detail & Related papers (2023-11-28T18:53:06Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - End-to-end Generative Pretraining for Multimodal Video Captioning [82.79187814057313]
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos.
Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly.
Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks.
arXiv Detail & Related papers (2022-01-20T16:16:21Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training.
To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z) - Understanding Chinese Video and Language via Contrastive Multimodal
Pre-Training [79.88705563918413]
We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training.
VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
arXiv Detail & Related papers (2021-04-19T15:58:45Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.