Condensed Movies: Story Based Retrieval with Contextual Embeddings
- URL: http://arxiv.org/abs/2005.04208v2
- Date: Thu, 22 Oct 2020 23:42:02 GMT
- Title: Condensed Movies: Story Based Retrieval with Contextual Embeddings
- Authors: Max Bain, Arsha Nagrani, Andrew Brown, Andrew Zisserman
- Abstract summary: We create the Condensed Movies dataset (CMD) consisting of the key scenes from over 3K movies.
The dataset is scalable, obtained automatically from YouTube, and is freely available for anybody to download and use.
We provide a deep network baseline for text-to-video retrieval on our dataset, combining character, speech and visual cues into a single video embedding.
- Score: 83.73479493450009
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Our objective in this work is long range understanding of the narrative
structure of movies. Instead of considering the entire movie, we propose to
learn from the `key scenes' of the movie, providing a condensed look at the
full storyline. To this end, we make the following three contributions: (i) We
create the Condensed Movies Dataset (CMD) consisting of the key scenes from
over 3K movies: each key scene is accompanied by a high level semantic
description of the scene, character face-tracks, and metadata about the movie.
The dataset is scalable, obtained automatically from YouTube, and is freely
available for anybody to download and use. It is also an order of magnitude
larger than existing movie datasets in the number of movies; (ii) We provide a
deep network baseline for text-to-video retrieval on our dataset, combining
character, speech and visual cues into a single video embedding; and finally
(iii) We demonstrate how the addition of context from other video clips
improves retrieval performance.
Related papers
- Movie101v2: Improved Movie Narration Benchmark [53.54176725112229]
We develop a large-scale, bilingual movie narration dataset, Movie101v2.
Taking into account the essential difficulties in achieving applicable movie narration, we break the long-term goal into three progressive stages.
Our findings reveal that achieving applicable movie narration generation is a fascinating goal that requires thorough research.
arXiv Detail & Related papers (2024-04-20T13:15:27Z) - Select and Summarize: Scene Saliency for Movie Script Summarization [11.318175666743656]
We introduce a scene saliency dataset that consists of human-annotated salient scenes for 100 movies.
We propose a two-stage abstractive summarization approach which first identifies the salient scenes in script and then generates a summary using only those scenes.
arXiv Detail & Related papers (2024-04-04T16:16:53Z) - Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset.
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z) - MovieCLIP: Visual Scene Recognition in Movies [38.90153620199725]
Existing visual scene datasets in movies have limited and don't consider the visual scene transition within movie clips.
In this work, we address the problem of visual scene recognition in movies by first automatically curating a new and extensive movie-centric taxonomy.
Instead of manual annotations which can be expensive, we use CLIP to weakly label 1.12 million shots from 32K movie clips based on our proposed taxonomy.
arXiv Detail & Related papers (2022-10-20T07:38:56Z) - Movies2Scenes: Using Movie Metadata to Learn Scene Representation [8.708989357658501]
We propose a novel contrastive learning approach that uses movie metadata to learn a general-purpose scene representation.
Specifically, we use movie metadata to define a measure of movie similarity, and use it during contrastive learning to limit our search for positive scene-pairs.
Our learned scene representation consistently outperforms existing state-of-the-art methods on a diverse set of tasks evaluated using multiple benchmark datasets.
arXiv Detail & Related papers (2022-02-22T03:31:33Z) - Movie Summarization via Sparse Graph Construction [65.16768855902268]
We propose a model that identifies TP scenes by building a sparse movie graph that represents relations between scenes and is constructed using multimodal information.
According to human judges, the summaries created by our approach are more informative and complete, and receive higher ratings, than the outputs of sequence-based models and general-purpose summarization algorithms.
arXiv Detail & Related papers (2020-12-14T13:54:34Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z) - A Local-to-Global Approach to Multi-modal Movie Scene Segmentation [95.34033481442353]
We build a large-scale video dataset MovieScenes, which contains 21K annotated scene segments from 150 movies.
We propose a local-to-global scene segmentation framework, which integrates multi-modal information across three levels, i.e. clip, segment, and movie.
Our experiments show that the proposed network is able to segment a movie into scenes with high accuracy, consistently outperforming previous methods.
arXiv Detail & Related papers (2020-04-06T13:58:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.