Synopses of Movie Narratives: a Video-Language Dataset for Story
Understanding
- URL: http://arxiv.org/abs/2203.05711v4
- Date: Wed, 5 Apr 2023 02:09:02 GMT
- Title: Synopses of Movie Narratives: a Video-Language Dataset for Story
Understanding
- Authors: Yidan Sun, Qin Chao, Yangfeng Ji and Boyang Li
- Abstract summary: We release a video-language story dataset, Synopses of Movie Narratives (SyMoN), containing 5,193 video summaries of popular movies and TV series with a total length of 869 hours.
SyMoN captures naturalistic storytelling videos made by human creators and intended for a human audience.
- Score: 13.52545041750095
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite recent advances of AI, story understanding remains an open and
under-investigated problem. We collect, preprocess, and publicly release a
video-language story dataset, Synopses of Movie Narratives (SyMoN), containing
5,193 video summaries of popular movies and TV series with a total length of
869 hours. SyMoN captures naturalistic storytelling videos made by human
creators and intended for a human audience. As a prototypical and naturalistic
story dataset, SyMoN features high coverage of multimodal story events and
abundant mental-state descriptions. Its use of storytelling techniques cause
cross-domain semantic gaps that provide appropriate challenges to existing
models. We establish benchmarks on video-text retrieval and zero-shot alignment
on movie summary videos, which showcase the importance of in-domain data and
long-term memory in story understanding. With SyMoN, we hope to lay the
groundwork for progress in multimodal story understanding.
Related papers
- MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence [55.977597688114514]
MovieDreamer is a novel hierarchical framework that integrates the strengths of autoregressive models with diffusion-based rendering.
We present experiments across various movie genres, demonstrating that our approach achieves superior visual and narrative quality.
arXiv Detail & Related papers (2024-07-23T17:17:05Z) - MoPS: Modular Story Premise Synthesis for Open-Ended Automatic Story Generation [50.01780173691132]
We introduce Modular Story Premise Synthesis (MoPS)
MoPS breaks down story premises into modules like background and persona for automated design and generation.
Thorough evaluations demonstrate that our synthesized premises excel in diversity, fascination, completeness, and originality.
arXiv Detail & Related papers (2024-06-09T08:31:14Z) - Movie101v2: Improved Movie Narration Benchmark [53.54176725112229]
We develop a large-scale, bilingual movie narration dataset, Movie101v2.
Taking into account the essential difficulties in achieving applicable movie narration, we break the long-term goal into three progressive stages.
Our findings reveal that achieving applicable movie narration generation is a fascinating goal that requires thorough research.
arXiv Detail & Related papers (2024-04-20T13:15:27Z) - VideoXum: Cross-modal Visual and Textural Summarization of Videos [54.0985975755278]
We propose a new joint video and text summarization task.
The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video.
The generated shortened video clip and text narratives should be semantically well aligned.
arXiv Detail & Related papers (2023-03-21T17:51:23Z) - Connecting Vision and Language with Video Localized Narratives [54.094554472715245]
We propose Video Localized Narratives, a new form of multimodal video annotations connecting vision and language.
In the original Localized Narratives, annotators speak and move their mouse simultaneously on an image, thus grounding each word with a mouse trace segment.
Our new protocol empowers annotators to tell the story of a video with Localized Narratives, capturing even complex events involving multiple actors interacting with each other and with several passive objects.
arXiv Detail & Related papers (2023-02-22T09:04:00Z) - NarraSum: A Large-Scale Dataset for Abstractive Narrative Summarization [26.80378373420446]
NarraSum is a large-scale narrative summarization dataset.
It contains 122K narrative documents, which are collected from plot descriptions of movies and TV episodes with diverse genres, and their corresponding abstractive summaries.
Experiments show that there is a large performance gap between humans and the state-of-the-art summarization models on NarraSum.
arXiv Detail & Related papers (2022-12-02T22:51:51Z) - TVRecap: A Dataset for Generating Stories with Character Descriptions [43.198875830024825]
TVRecap is a story generation dataset that generates detailed TV show episode recaps from a brief summary and documents describing the characters involved.
We create TVRecap from fan-contributed websites, which allows us to collect 26k episode recaps with 1868.7 tokens on average.
arXiv Detail & Related papers (2021-09-18T05:02:29Z) - VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.