Movie101: A New Movie Understanding Benchmark
- URL: http://arxiv.org/abs/2305.12140v2
- Date: Tue, 27 Jun 2023 11:42:44 GMT
- Title: Movie101: A New Movie Understanding Benchmark
- Authors: Zihao Yue, Qi Zhang, Anwen Hu, Liang Zhang, Ziheng Wang and Qin Jin
- Abstract summary: We construct a large-scale Chinese movie benchmark, named Movie101.
We propose a new metric called Movie Narration Score (MNScore) for movie narrating evaluation.
For both two tasks, our proposed methods well leverage external knowledge and outperform carefully designed baselines.
- Score: 47.24519006577205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To help the visually impaired enjoy movies, automatic movie narrating systems
are expected to narrate accurate, coherent, and role-aware plots when there are
no speaking lines of actors. Existing works benchmark this challenge as a
normal video captioning task via some simplifications, such as removing role
names and evaluating narrations with ngram-based metrics, which makes it
difficult for automatic systems to meet the needs of real application
scenarios. To narrow this gap, we construct a large-scale Chinese movie
benchmark, named Movie101. Closer to real scenarios, the Movie Clip Narrating
(MCN) task in our benchmark asks models to generate role-aware narration
paragraphs for complete movie clips where no actors are speaking. External
knowledge, such as role information and movie genres, is also provided for
better movie understanding. Besides, we propose a new metric called Movie
Narration Score (MNScore) for movie narrating evaluation, which achieves the
best correlation with human evaluation. Our benchmark also supports the
Temporal Narration Grounding (TNG) task to investigate clip localization given
text descriptions. For both two tasks, our proposed methods well leverage
external knowledge and outperform carefully designed baselines. The dataset and
codes are released at https://github.com/yuezih/Movie101.
Related papers
- Multilingual Synopses of Movie Narratives: A Dataset for Story Understanding [19.544839928488972]
We construct a large-scale multilingual video story dataset named Multilingual Synopses of Movie Narratives (M-SYMON)
M-SYMON contains 13,166 movie summary videos from 7 languages, as well as manual annotation of fine-grained video-text correspondences for 101.5 hours of video.
Training on the human annotated data from SyMoN outperforms the SOTA methods by 15.7 and 16.2 percentage points on Clip Accuracy and Sentence IoU scores, respectively.
arXiv Detail & Related papers (2024-06-18T22:44:50Z) - Movie101v2: Improved Movie Narration Benchmark [53.54176725112229]
We develop a large-scale, bilingual movie narration dataset, Movie101v2.
Taking into account the essential difficulties in achieving applicable movie narration, we break the long-term goal into three progressive stages.
Our findings reveal that achieving applicable movie narration generation is a fascinating goal that requires thorough research.
arXiv Detail & Related papers (2024-04-20T13:15:27Z) - Select and Summarize: Scene Saliency for Movie Script Summarization [11.318175666743656]
We introduce a scene saliency dataset that consists of human-annotated salient scenes for 100 movies.
We propose a two-stage abstractive summarization approach which first identifies the salient scenes in script and then generates a summary using only those scenes.
arXiv Detail & Related papers (2024-04-04T16:16:53Z) - HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [77.02631712558251]
We propose to leverage the capability of large language models (LLMs) to obtain fine-grained video descriptions aligned with videos.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
Our evaluation shows that the resulting captions not only significantly improve the performance over many different benchmark datasets for text-video retrieval.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - Movie Genre Classification by Language Augmentation and Shot Sampling [20.119729119879466]
We propose a Movie genre Classification method based on Language augmentatIon and shot samPling (Movie-CLIP)
Movie-CLIP mainly consists of two parts: a language augmentation module to recognize language elements from the input audio, and a shot sampling module to select representative shots from the entire video.
We evaluate our method on MovieNet and Condensed Movies datasets, achieving approximate 6-9% improvement in mean Average Precision (mAP) over the baselines.
arXiv Detail & Related papers (2022-03-24T18:15:12Z) - Multilevel profiling of situation and dialogue-based deep networks for
movie genre classification using movie trailers [7.904790547594697]
We propose a novel multi-modality: situation, dialogue, and metadata-based movie genre classification framework.
We develop the English movie trailer dataset (EMTD), which contains 2000 Hollywood movie trailers belonging to five popular genres.
arXiv Detail & Related papers (2021-09-14T07:33:56Z) - Movie Summarization via Sparse Graph Construction [65.16768855902268]
We propose a model that identifies TP scenes by building a sparse movie graph that represents relations between scenes and is constructed using multimodal information.
According to human judges, the summaries created by our approach are more informative and complete, and receive higher ratings, than the outputs of sequence-based models and general-purpose summarization algorithms.
arXiv Detail & Related papers (2020-12-14T13:54:34Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Condensed Movies: Story Based Retrieval with Contextual Embeddings [83.73479493450009]
We create the Condensed Movies dataset (CMD) consisting of the key scenes from over 3K movies.
The dataset is scalable, obtained automatically from YouTube, and is freely available for anybody to download and use.
We provide a deep network baseline for text-to-video retrieval on our dataset, combining character, speech and visual cues into a single video embedding.
arXiv Detail & Related papers (2020-05-08T17:55:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.