Connecting Vision and Language with Video Localized Narratives
- URL: http://arxiv.org/abs/2302.11217v1
- Date: Wed, 22 Feb 2023 09:04:00 GMT
- Title: Connecting Vision and Language with Video Localized Narratives
- Authors: Paul Voigtlaender and Soravit Changpinyo and Jordi Pont-Tuset and Radu
Soricut and Vittorio Ferrari
- Abstract summary: We propose Video Localized Narratives, a new form of multimodal video annotations connecting vision and language.
In the original Localized Narratives, annotators speak and move their mouse simultaneously on an image, thus grounding each word with a mouse trace segment.
Our new protocol empowers annotators to tell the story of a video with Localized Narratives, capturing even complex events involving multiple actors interacting with each other and with several passive objects.
- Score: 54.094554472715245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose Video Localized Narratives, a new form of multimodal video
annotations connecting vision and language. In the original Localized
Narratives, annotators speak and move their mouse simultaneously on an image,
thus grounding each word with a mouse trace segment. However, this is
challenging on a video. Our new protocol empowers annotators to tell the story
of a video with Localized Narratives, capturing even complex events involving
multiple actors interacting with each other and with several passive objects.
We annotated 20k videos of the OVIS, UVO, and Oops datasets, totalling 1.7M
words. Based on this data, we also construct new benchmarks for the video
narrative grounding and video question-answering tasks, and provide reference
results from strong baseline models. Our annotations are available at
https://google.github.io/video-localized-narratives/.
Related papers
- Learning Video Context as Interleaved Multimodal Sequences [40.15446453928028]
MovieSeq is a multimodal language model developed to address the wide range of challenges in understanding video contexts.
Our core idea is to represent videos as interleaved multimodal sequences, either by linking external knowledge databases or using offline models.
To demonstrate its effectiveness, we validate MovieSeq's performance on six datasets.
arXiv Detail & Related papers (2024-07-31T17:23:57Z) - Multilingual Synopses of Movie Narratives: A Dataset for Vision-Language Story Understanding [19.544839928488972]
We construct a large-scale multilingual video story dataset named Multilingual Synopses of Movie Narratives (M-SYMON)
M-SYMON contains 13,166 movie summary videos from 7 languages, as well as manual annotation of fine-grained video-text correspondences for 101.5 hours of video.
Training on the human annotated data from SyMoN outperforms the SOTA methods by 15.7 and 16.2 percentage points on Clip Accuracy and Sentence IoU scores, respectively.
arXiv Detail & Related papers (2024-06-18T22:44:50Z) - DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips.
The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z) - VideoXum: Cross-modal Visual and Textural Summarization of Videos [54.0985975755278]
We propose a new joint video and text summarization task.
The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video.
The generated shortened video clip and text narratives should be semantically well aligned.
arXiv Detail & Related papers (2023-03-21T17:51:23Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - Narration Generation for Cartoon Videos [35.814965300322015]
We propose a new task, narration generation, that is complementing videos with narration texts that are to be interjected in several places.
We collect a new dataset from the animated television series Peppa Pig.
arXiv Detail & Related papers (2021-01-17T23:23:09Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.