A Local-to-Global Approach to Multi-modal Movie Scene Segmentation
- URL: http://arxiv.org/abs/2004.02678v3
- Date: Tue, 28 Apr 2020 14:30:05 GMT
- Title: A Local-to-Global Approach to Multi-modal Movie Scene Segmentation
- Authors: Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou,
Dahua Lin
- Abstract summary: We build a large-scale video dataset MovieScenes, which contains 21K annotated scene segments from 150 movies.
We propose a local-to-global scene segmentation framework, which integrates multi-modal information across three levels, i.e. clip, segment, and movie.
Our experiments show that the proposed network is able to segment a movie into scenes with high accuracy, consistently outperforming previous methods.
- Score: 95.34033481442353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene, as the crucial unit of storytelling in movies, contains complex
activities of actors and their interactions in a physical environment.
Identifying the composition of scenes serves as a critical step towards
semantic understanding of movies. This is very challenging -- compared to the
videos studied in conventional vision problems, e.g. action recognition, as
scenes in movies usually contain much richer temporal structures and more
complex semantic information. Towards this goal, we scale up the scene
segmentation task by building a large-scale video dataset MovieScenes, which
contains 21K annotated scene segments from 150 movies. We further propose a
local-to-global scene segmentation framework, which integrates multi-modal
information across three levels, i.e. clip, segment, and movie. This framework
is able to distill complex semantics from hierarchical temporal structures over
a long movie, providing top-down guidance for scene segmentation. Our
experiments show that the proposed network is able to segment a movie into
scenes with high accuracy, consistently outperforming previous methods. We also
found that pretraining on our MovieScenes can bring significant improvements to
the existing approaches.
Related papers
- Select and Summarize: Scene Saliency for Movie Script Summarization [11.318175666743656]
We introduce a scene saliency dataset that consists of human-annotated salient scenes for 100 movies.
We propose a two-stage abstractive summarization approach which first identifies the salient scenes in script and then generates a summary using only those scenes.
arXiv Detail & Related papers (2024-04-04T16:16:53Z) - Multi-Modal Domain Adaptation Across Video Scenes for Temporal Video
Grounding [59.599378814835205]
Temporal Video Grounding (TVG) aims to localize the temporal boundary of a specific segment in an untrimmed video based on a given language query.
We introduce a novel AMDA method to adaptively adjust the model's scene-related knowledge by incorporating insights from the target data.
arXiv Detail & Related papers (2023-12-21T07:49:27Z) - MoviePuzzle: Visual Narrative Reasoning through Multimodal Order
Learning [54.73173491543553]
MoviePuzzle is a novel challenge that targets visual narrative reasoning and holistic movie understanding.
To tackle this quandary, we put forth MoviePuzzle task that amplifies the temporal feature learning and structure learning of video models.
Our approach outperforms existing state-of-the-art methods on the MoviePuzzle benchmark.
arXiv Detail & Related papers (2023-06-04T03:51:54Z) - Scene Consistency Representation Learning for Video Scene Segmentation [26.790491577584366]
We propose an effective Self-Supervised Learning (SSL) framework to learn better shot representations from long-term videos.
We present an SSL scheme to achieve scene consistency, while exploring considerable data augmentation and shuffling methods to boost the model generalizability.
Our method achieves the state-of-the-art performance on the task of Video Scene.
arXiv Detail & Related papers (2022-05-11T13:31:15Z) - Movies2Scenes: Using Movie Metadata to Learn Scene Representation [8.708989357658501]
We propose a novel contrastive learning approach that uses movie metadata to learn a general-purpose scene representation.
Specifically, we use movie metadata to define a measure of movie similarity, and use it during contrastive learning to limit our search for positive scene-pairs.
Our learned scene representation consistently outperforms existing state-of-the-art methods on a diverse set of tasks evaluated using multiple benchmark datasets.
arXiv Detail & Related papers (2022-02-22T03:31:33Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - A Hierarchical Multi-Modal Encoder for Moment Localization in Video
Corpus [31.387948069111893]
We show how to identify a short segment in a long video that semantically matches a text query.
To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level.
We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
arXiv Detail & Related papers (2020-11-18T02:42:36Z) - Toward Accurate Person-level Action Recognition in Videos of Crowded
Scenes [131.9067467127761]
We focus on improving the action recognition by fully-utilizing the information of scenes and collecting new data.
Specifically, we adopt a strong human detector to detect spatial location of each frame.
We then apply action recognition models to learn thetemporal information from video frames on both the HIE dataset and new data with diverse scenes from the internet.
arXiv Detail & Related papers (2020-10-16T13:08:50Z) - Condensed Movies: Story Based Retrieval with Contextual Embeddings [83.73479493450009]
We create the Condensed Movies dataset (CMD) consisting of the key scenes from over 3K movies.
The dataset is scalable, obtained automatically from YouTube, and is freely available for anybody to download and use.
We provide a deep network baseline for text-to-video retrieval on our dataset, combining character, speech and visual cues into a single video embedding.
arXiv Detail & Related papers (2020-05-08T17:55:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.