Movie Trailer Genre Classification Using Multimodal Pretrained Features
- URL: http://arxiv.org/abs/2410.19760v1
- Date: Fri, 11 Oct 2024 15:38:05 GMT
- Title: Movie Trailer Genre Classification Using Multimodal Pretrained Features
- Authors: Serkan Sulun, Paula Viana, Matthew E. P. Davies,
- Abstract summary: We introduce a novel method for movie genre classification, capitalizing on a diverse set of readily accessible pretrained models.
Our approach utilizes all video and audio frames of movie trailers without performing any temporal pooling.
Our method outperforms state-of-the-art movie genre classification models in terms of precision, recall, and mean average precision (mAP)
- Score: 1.1743167854433303
- License:
- Abstract: We introduce a novel method for movie genre classification, capitalizing on a diverse set of readily accessible pretrained models. These models extract high-level features related to visual scenery, objects, characters, text, speech, music, and audio effects. To intelligently fuse these pretrained features, we train small classifier models with low time and memory requirements. Employing the transformer model, our approach utilizes all video and audio frames of movie trailers without performing any temporal pooling, efficiently exploiting the correspondence between all elements, as opposed to the fixed and low number of frames typically used by traditional methods. Our approach fuses features originating from different tasks and modalities, with different dimensionalities, different temporal lengths, and complex dependencies as opposed to current approaches. Our method outperforms state-of-the-art movie genre classification models in terms of precision, recall, and mean average precision (mAP). To foster future research, we make the pretrained features for the entire MovieNet dataset, along with our genre classification code and the trained models, publicly available.
Related papers
- Movie Recommendation with Poster Attention via Multi-modal Transformer Feature Fusion [4.228539709089597]
This study proposes a multi-modal movie recommendation system by extract features of the well designed posters for each movie.
The efficiency of the proof-of-concept model is verified by the standard benchmark problem the MovieLens 100K and 1M datasets.
arXiv Detail & Related papers (2024-07-12T10:44:51Z) - Purposer: Putting Human Motion Generation in Context [30.706219830149504]
We present a novel method to generate human motion to populate 3D indoor scenes.
It can be controlled with various combinations of conditioning signals such as a path in a scene, target poses, past motions, and scenes represented as 3D point clouds.
arXiv Detail & Related papers (2024-04-19T15:16:04Z) - UnLoc: A Unified Framework for Video Localization Tasks [82.59118972890262]
UnLoc is a new approach for temporal localization in untrimmed videos.
It uses pretrained image and text towers, and feeds tokens to a video-text fusion model.
We achieve state of the art results on all three different localization tasks with a unified approach.
arXiv Detail & Related papers (2023-08-21T22:15:20Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning.
We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z) - Movie Genre Classification by Language Augmentation and Shot Sampling [20.119729119879466]
We propose a Movie genre Classification method based on Language augmentatIon and shot samPling (Movie-CLIP)
Movie-CLIP mainly consists of two parts: a language augmentation module to recognize language elements from the input audio, and a shot sampling module to select representative shots from the entire video.
We evaluate our method on MovieNet and Condensed Movies datasets, achieving approximate 6-9% improvement in mean Average Precision (mAP) over the baselines.
arXiv Detail & Related papers (2022-03-24T18:15:12Z) - Leveraging Local Temporal Information for Multimodal Scene
Classification [9.548744259567837]
Video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively.
Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks.
We propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames.
arXiv Detail & Related papers (2021-10-26T19:58:32Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization
Tasks [79.01176229586855]
We propose a novel supervised pretraining paradigm for clip features that considers background clips and global video information to improve temporal sensitivity.
Extensive experiments show that using features trained with our novel pretraining strategy significantly improves the performance of recent state-of-the-art methods on three tasks.
arXiv Detail & Related papers (2020-11-23T15:40:15Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.