MoviePuzzle: Visual Narrative Reasoning through Multimodal Order
Learning
- URL: http://arxiv.org/abs/2306.02252v2
- Date: Wed, 14 Jun 2023 10:11:38 GMT
- Title: MoviePuzzle: Visual Narrative Reasoning through Multimodal Order
Learning
- Authors: Jianghui Wang, Yuxuan Wang, Dongyan Zhao, Zilong Zheng
- Abstract summary: MoviePuzzle is a novel challenge that targets visual narrative reasoning and holistic movie understanding.
To tackle this quandary, we put forth MoviePuzzle task that amplifies the temporal feature learning and structure learning of video models.
Our approach outperforms existing state-of-the-art methods on the MoviePuzzle benchmark.
- Score: 54.73173491543553
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce MoviePuzzle, a novel challenge that targets visual narrative
reasoning and holistic movie understanding. Despite the notable progress that
has been witnessed in the realm of video understanding, most prior works fail
to present tasks and models to address holistic video understanding and the
innate visual narrative structures existing in long-form videos. To tackle this
quandary, we put forth MoviePuzzle task that amplifies the temporal feature
learning and structure learning of video models by reshuffling the shot, frame,
and clip layers of movie segments in the presence of video-dialogue
information. We start by establishing a carefully refined dataset based on
MovieNet by dissecting movies into hierarchical layers and randomly permuting
the orders. Besides benchmarking the MoviePuzzle with prior arts on movie
understanding, we devise a Hierarchical Contrastive Movie Clustering (HCMC)
model that considers the underlying structure and visual semantic orders for
movie reordering. Specifically, through a pairwise and contrastive learning
approach, we train models to predict the correct order of each layer. This
equips them with the knack for deciphering the visual narrative structure of
movies and handling the disorder lurking in video data. Experiments show that
our approach outperforms existing state-of-the-art methods on the \MoviePuzzle
benchmark, underscoring its efficacy.
Related papers
- DiscoGraMS: Enhancing Movie Screen-Play Summarization using Movie Character-Aware Discourse Graph [6.980991481207376]
We introduce DiscoGraMS, a novel resource that represents movie scripts as a movie character-aware discourse graph (CaD Graph)
The model aims to preserve all salient information, offering a more comprehensive and faithful representation of the screenplay's content.
arXiv Detail & Related papers (2024-10-18T17:56:11Z) - MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence [62.72540590546812]
MovieDreamer is a novel hierarchical framework that integrates the strengths of autoregressive models with diffusion-based rendering.
We present experiments across various movie genres, demonstrating that our approach achieves superior visual and narrative quality.
arXiv Detail & Related papers (2024-07-23T17:17:05Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Learning from Untrimmed Videos: Self-Supervised Video Representation
Learning with Hierarchical Consistency [60.756222188023635]
We propose to learn representations by leveraging more abundant information in unsupervised videos.
HiCo can generate stronger representations on untrimmed videos, it also improves the representation quality when applied to trimmed videos.
arXiv Detail & Related papers (2022-04-06T18:04:54Z) - Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z) - Highlight Timestamp Detection Model for Comedy Videos via Multimodal
Sentiment Analysis [1.6181085766811525]
We propose a multimodal structure to obtain state-of-the-art performance in this field.
We select several benchmarks for multimodal video understanding and apply the most suitable model to find the best performance.
arXiv Detail & Related papers (2021-05-28T08:39:19Z) - Movie Summarization via Sparse Graph Construction [65.16768855902268]
We propose a model that identifies TP scenes by building a sparse movie graph that represents relations between scenes and is constructed using multimodal information.
According to human judges, the summaries created by our approach are more informative and complete, and receive higher ratings, than the outputs of sequence-based models and general-purpose summarization algorithms.
arXiv Detail & Related papers (2020-12-14T13:54:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.