Movie101v2: Improved Movie Narration Benchmark
- URL: http://arxiv.org/abs/2404.13370v1
- Date: Sat, 20 Apr 2024 13:15:27 GMT
- Title: Movie101v2: Improved Movie Narration Benchmark
- Authors: Zihao Yue, Yepeng Zhang, Ziheng Wang, Qin Jin,
- Abstract summary: We develop a large-scale, bilingual movie narration dataset, Movie101v2.
Taking into account the essential difficulties in achieving applicable movie narration, we break the long-term goal into three progressive stages.
Our findings reveal that achieving applicable movie narration generation is a fascinating goal that requires thorough research.
- Score: 53.54176725112229
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic movie narration targets at creating video-aligned plot descriptions to assist visually impaired audiences. It differs from standard video captioning in that it requires not only describing key visual details but also inferring the plots developed across multiple movie shots, thus posing unique and ongoing challenges. To advance the development of automatic movie narrating systems, we first revisit the limitations of existing datasets and develop a large-scale, bilingual movie narration dataset, Movie101v2. Second, taking into account the essential difficulties in achieving applicable movie narration, we break the long-term goal into three progressive stages and tentatively focus on the initial stages featuring understanding within individual clips. We also introduce a new narration assessment to align with our staged task goals. Third, using our new dataset, we baseline several leading large vision-language models, including GPT-4V, and conduct in-depth investigations into the challenges current models face for movie narration generation. Our findings reveal that achieving applicable movie narration generation is a fascinating goal that requires thorough research.
Related papers
- MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence [55.977597688114514]
MovieDreamer is a novel hierarchical framework that integrates the strengths of autoregressive models with diffusion-based rendering.
We present experiments across various movie genres, demonstrating that our approach achieves superior visual and narrative quality.
arXiv Detail & Related papers (2024-07-23T17:17:05Z) - Select and Summarize: Scene Saliency for Movie Script Summarization [11.318175666743656]
We introduce a scene saliency dataset that consists of human-annotated salient scenes for 100 movies.
We propose a two-stage abstractive summarization approach which first identifies the salient scenes in script and then generates a summary using only those scenes.
arXiv Detail & Related papers (2024-04-04T16:16:53Z) - MovieFactory: Automatic Movie Creation from Text using Large Generative
Models for Language and Images [92.13079696503803]
We present MovieFactory, a framework to generate cinematic-picture (3072$times$1280), film-style (multi-scene), and multi-modality (sounding) movies.
Our approach empowers users to create captivating movies with smooth transitions using simple text inputs.
arXiv Detail & Related papers (2023-06-12T17:31:23Z) - Movie101: A New Movie Understanding Benchmark [47.24519006577205]
We construct a large-scale Chinese movie benchmark, named Movie101.
We propose a new metric called Movie Narration Score (MNScore) for movie narrating evaluation.
For both two tasks, our proposed methods well leverage external knowledge and outperform carefully designed baselines.
arXiv Detail & Related papers (2023-05-20T08:43:51Z) - Synopses of Movie Narratives: a Video-Language Dataset for Story
Understanding [13.52545041750095]
We release a video-language story dataset, Synopses of Movie Narratives (SyMoN), containing 5,193 video summaries of popular movies and TV series with a total length of 869 hours.
SyMoN captures naturalistic storytelling videos made by human creators and intended for a human audience.
arXiv Detail & Related papers (2022-03-11T01:45:33Z) - Movie Summarization via Sparse Graph Construction [65.16768855902268]
We propose a model that identifies TP scenes by building a sparse movie graph that represents relations between scenes and is constructed using multimodal information.
According to human judges, the summaries created by our approach are more informative and complete, and receive higher ratings, than the outputs of sequence-based models and general-purpose summarization algorithms.
arXiv Detail & Related papers (2020-12-14T13:54:34Z) - Condensed Movies: Story Based Retrieval with Contextual Embeddings [83.73479493450009]
We create the Condensed Movies dataset (CMD) consisting of the key scenes from over 3K movies.
The dataset is scalable, obtained automatically from YouTube, and is freely available for anybody to download and use.
We provide a deep network baseline for text-to-video retrieval on our dataset, combining character, speech and visual cues into a single video embedding.
arXiv Detail & Related papers (2020-05-08T17:55:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.