Related papers: Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding

Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding

URL: http://arxiv.org/abs/2203.05711v4
Date: Wed, 5 Apr 2023 02:09:02 GMT
Title: Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding
Authors: Yidan Sun, Qin Chao, Yangfeng Ji and Boyang Li
Abstract summary: We release a video-language story dataset, Synopses of Movie Narratives (SyMoN), containing 5,193 video summaries of popular movies and TV series with a total length of 869 hours. SyMoN captures naturalistic storytelling videos made by human creators and intended for a human audience.
Score: 13.52545041750095
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite recent advances of AI, story understanding remains an open and under-investigated problem. We collect, preprocess, and publicly release a video-language story dataset, Synopses of Movie Narratives (SyMoN), containing 5,193 video summaries of popular movies and TV series with a total length of 869 hours. SyMoN captures naturalistic storytelling videos made by human creators and intended for a human audience. As a prototypical and naturalistic story dataset, SyMoN features high coverage of multimodal story events and abundant mental-state descriptions. Its use of storytelling techniques cause cross-domain semantic gaps that provide appropriate challenges to existing models. We establish benchmarks on video-text retrieval and zero-shot alignment on movie summary videos, which showcase the importance of in-domain data and long-term memory in story understanding. With SyMoN, we hope to lay the groundwork for progress in multimodal story understanding.

Related papers

MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation [62.85764872989189]
There is no publicly available dataset tailored for the analysis, evaluation, and training of long video generation models. We present MovieBench: A Hierarchical Movie-Level dataset for Long Video Generation. The dataset will be public and continuously maintained, aiming to advance the field of long video generation.
arXiv Detail & Related papers (2024-11-22T10:25:08Z)
ScreenWriter: Automatic Screenplay Generation and Movie Summarisation [55.20132267309382]
Video content has driven demand for textual descriptions or summaries that allow users to recall key plot points or get an overview without watching. We propose the task of automatic screenplay generation, and a method, ScreenWriter, that operates only on video and produces output which includes dialogue, speaker names, scene breaks, and visual descriptions. ScreenWriter introduces a novel algorithm to segment the video into scenes based on the sequence of visual vectors, and a novel method for the challenging problem of determining character names, based on a database of actors' faces.
arXiv Detail & Related papers (2024-10-17T07:59:54Z)
SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses [58.488812405557]
Video grounding aims to localize specific natural language queries in an untrimmed video. We present a large-scale video grounding dataset named SynopGround. We introduce a more complex setting of video grounding dubbed Multi-Paragraph Video Grounding (MPVG)
arXiv Detail & Related papers (2024-08-03T05:35:13Z)
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence [62.72540590546812]
MovieDreamer is a novel hierarchical framework that integrates the strengths of autoregressive models with diffusion-based rendering. We present experiments across various movie genres, demonstrating that our approach achieves superior visual and narrative quality.
arXiv Detail & Related papers (2024-07-23T17:17:05Z)
Multilingual Synopses of Movie Narratives: A Dataset for Vision-Language Story Understanding [19.544839928488972]
We construct a large-scale multilingual video story dataset named Multilingual Synopses of Movie Narratives (M-SYMON) M-SYMON contains 13,166 movie summary videos from 7 languages, as well as manual annotation of fine-grained video-text correspondences for 101.5 hours of video. Training on the human annotated data from SyMoN outperforms the SOTA methods by 15.7 and 16.2 percentage points on Clip Accuracy and Sentence IoU scores, respectively.
arXiv Detail & Related papers (2024-06-18T22:44:50Z)
Movie101v2: Improved Movie Narration Benchmark [53.54176725112229]
Automatic movie narration aims to generate video-aligned plot descriptions to assist visually impaired audiences. We introduce Movie101v2, a large-scale, bilingual dataset with enhanced data quality specifically designed for movie narration. Based on our new benchmark, we baseline a range of large vision-language models, including GPT-4V, and conduct an in-depth analysis of the challenges in narration generation.
arXiv Detail & Related papers (2024-04-20T13:15:27Z)
Connecting Vision and Language with Video Localized Narratives [54.094554472715245]
We propose Video Localized Narratives, a new form of multimodal video annotations connecting vision and language. In the original Localized Narratives, annotators speak and move their mouse simultaneously on an image, thus grounding each word with a mouse trace segment. Our new protocol empowers annotators to tell the story of a video with Localized Narratives, capturing even complex events involving multiple actors interacting with each other and with several passive objects.
arXiv Detail & Related papers (2023-02-22T09:04:00Z)
NarraSum: A Large-Scale Dataset for Abstractive Narrative Summarization [26.80378373420446]
NarraSum is a large-scale narrative summarization dataset. It contains 122K narrative documents, which are collected from plot descriptions of movies and TV episodes with diverse genres, and their corresponding abstractive summaries. Experiments show that there is a large performance gap between humans and the state-of-the-art summarization models on NarraSum.
arXiv Detail & Related papers (2022-12-02T22:51:51Z)
TVRecap: A Dataset for Generating Stories with Character Descriptions [43.198875830024825]
TVRecap is a story generation dataset that generates detailed TV show episode recaps from a brief summary and documents describing the characters involved. We create TVRecap from fan-contributed websites, which allows us to collect 26k episode recaps with 1868.7 tokens on average.
arXiv Detail & Related papers (2021-09-18T05:02:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.