Movie Genre Classification by Language Augmentation and Shot Sampling
- URL: http://arxiv.org/abs/2203.13281v2
- Date: Tue, 7 Nov 2023 19:29:12 GMT
- Title: Movie Genre Classification by Language Augmentation and Shot Sampling
- Authors: Zhongping Zhang, Yiwen Gu, Bryan A. Plummer, Xin Miao, Jiayi Liu,
Huayan Wang
- Abstract summary: We propose a Movie genre Classification method based on Language augmentatIon and shot samPling (Movie-CLIP)
Movie-CLIP mainly consists of two parts: a language augmentation module to recognize language elements from the input audio, and a shot sampling module to select representative shots from the entire video.
We evaluate our method on MovieNet and Condensed Movies datasets, achieving approximate 6-9% improvement in mean Average Precision (mAP) over the baselines.
- Score: 20.119729119879466
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-based movie genre classification has garnered considerable attention
due to its various applications in recommendation systems. Prior work has
typically addressed this task by adapting models from traditional video
classification tasks, such as action recognition or event detection. However,
these models often neglect language elements (e.g., narrations or
conversations) present in videos, which can implicitly convey high-level
semantics of movie genres, like storylines or background context. Additionally,
existing approaches are primarily designed to encode the entire content of the
input video, leading to inefficiencies in predicting movie genres. Movie genre
prediction may require only a few shots to accurately determine the genres,
rendering a comprehensive understanding of the entire video unnecessary. To
address these challenges, we propose a Movie genre Classification method based
on Language augmentatIon and shot samPling (Movie-CLIP). Movie-CLIP mainly
consists of two parts: a language augmentation module to recognize language
elements from the input audio, and a shot sampling module to select
representative shots from the entire video. We evaluate our method on MovieNet
and Condensed Movies datasets, achieving approximate 6-9% improvement in mean
Average Precision (mAP) over the baselines. We also generalize Movie-CLIP to
the scene boundary detection task, achieving 1.1% improvement in Average
Precision (AP) over the state-of-the-art. We release our implementation at
github.com/Zhongping-Zhang/Movie-CLIP.
Related papers
- Movie Trailer Genre Classification Using Multimodal Pretrained Features [1.1743167854433303]
We introduce a novel method for movie genre classification, capitalizing on a diverse set of readily accessible pretrained models.
Our approach utilizes all video and audio frames of movie trailers without performing any temporal pooling.
Our method outperforms state-of-the-art movie genre classification models in terms of precision, recall, and mean average precision (mAP)
arXiv Detail & Related papers (2024-10-11T15:38:05Z) - Movie101v2: Improved Movie Narration Benchmark [53.54176725112229]
Automatic movie narration aims to generate video-aligned plot descriptions to assist visually impaired audiences.
We introduce Movie101v2, a large-scale, bilingual dataset with enhanced data quality specifically designed for movie narration.
Based on our new benchmark, we baseline a range of large vision-language models, including GPT-4V, and conduct an in-depth analysis of the challenges in narration generation.
arXiv Detail & Related papers (2024-04-20T13:15:27Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - MovieFactory: Automatic Movie Creation from Text using Large Generative
Models for Language and Images [92.13079696503803]
We present MovieFactory, a framework to generate cinematic-picture (3072$times$1280), film-style (multi-scene), and multi-modality (sounding) movies.
Our approach empowers users to create captivating movies with smooth transitions using simple text inputs.
arXiv Detail & Related papers (2023-06-12T17:31:23Z) - Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework [108.70949305791201]
Part-level Action Parsing (PAP) aims to not only predict the video-level action but also recognize the frame-level fine-grained actions or interactions of body parts for each person in the video.
In particular, our framework first predicts the video-level class of the input video, then localizes the body parts and predicts the part-level action.
Our framework achieves state-of-the-art performance and outperforms existing methods over a 31.10% ROC score.
arXiv Detail & Related papers (2022-03-09T01:30:57Z) - Multilevel profiling of situation and dialogue-based deep networks for
movie genre classification using movie trailers [7.904790547594697]
We propose a novel multi-modality: situation, dialogue, and metadata-based movie genre classification framework.
We develop the English movie trailer dataset (EMTD), which contains 2000 Hollywood movie trailers belonging to five popular genres.
arXiv Detail & Related papers (2021-09-14T07:33:56Z) - CLIP-It! Language-Guided Video Summarization [96.69415453447166]
This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization.
We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another.
Our model can be extended to the unsupervised setting by training without ground-truth supervision.
arXiv Detail & Related papers (2021-07-01T17:59:27Z) - Video Moment Localization using Object Evidence and Reverse Captioning [1.1549572298362785]
We address the problem of language-based temporal localization of moments in untrimmed videos.
Current state-of-the-art model MAC addresses it by mining activity concepts from both video and language modalities.
We propose "Multi-faceted VideoMoment Localizer" (MML), an extension of MAC model by the introduction of visual object evidence.
arXiv Detail & Related papers (2020-06-18T03:45:49Z) - A Local-to-Global Approach to Multi-modal Movie Scene Segmentation [95.34033481442353]
We build a large-scale video dataset MovieScenes, which contains 21K annotated scene segments from 150 movies.
We propose a local-to-global scene segmentation framework, which integrates multi-modal information across three levels, i.e. clip, segment, and movie.
Our experiments show that the proposed network is able to segment a movie into scenes with high accuracy, consistently outperforming previous methods.
arXiv Detail & Related papers (2020-04-06T13:58:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.