A Case Study of Deep Learning Based Multi-Modal Methods for Predicting
the Age-Suitability Rating of Movie Trailers
- URL: http://arxiv.org/abs/2101.11704v1
- Date: Tue, 26 Jan 2021 17:15:35 GMT
- Title: A Case Study of Deep Learning Based Multi-Modal Methods for Predicting
the Age-Suitability Rating of Movie Trailers
- Authors: Mahsa Shafaei, Christos Smailis, Ioannis A. Kakadiaris, Thamar Solorio
- Abstract summary: We introduce a new dataset containing videos of movie trailers in English downloaded from IMDB and YouTube.
We propose a multi-modal deep learning pipeline addressing the movie trailer age suitability rating problem.
- Score: 15.889598494755646
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we explore different approaches to combine modalities for the
problem of automated age-suitability rating of movie trailers. First, we
introduce a new dataset containing videos of movie trailers in English
downloaded from IMDB and YouTube, along with their corresponding
age-suitability rating labels. Secondly, we propose a multi-modal deep learning
pipeline addressing the movie trailer age suitability rating problem. This is
the first attempt to combine video, audio, and speech information for this
problem, and our experimental results show that multi-modal approaches
significantly outperform the best mono and bimodal models in this task.
Related papers
- MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions.
We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos.
Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z) - Towards Automated Movie Trailer Generation [98.9854474456265]
We introduce Trailer Generation Transformer (TGT), a deep-learning framework utilizing an encoder-decoder architecture.
TGT movie encoder is tasked with contextualizing each movie shot representation via self-attention, while the autoregressive trailer decoder predicts the feature representation of the next trailer shot.
Our TGT significantly outperforms previous methods on a comprehensive suite of metrics.
arXiv Detail & Related papers (2024-04-04T14:28:34Z) - Are We on the Right Way for Evaluating Large Vision-Language Models? [92.5761176224556]
Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities.
We identify two primary issues: Visual content is unnecessary for many samples and intentional data leakage exists.
We present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans.
arXiv Detail & Related papers (2024-03-29T17:59:34Z) - Find the Cliffhanger: Multi-Modal Trailerness in Soap Operas [17.476344577463525]
We introduce a multi-modal method for predicting the trailerness to assist editors in selecting trailer- worthy moments from long-form videos.
We present results on a newly introduced soap opera dataset, demonstrating that predicting trailerness is a challenging task.
arXiv Detail & Related papers (2024-01-29T11:34:36Z) - Read, Look or Listen? What's Needed for Solving a Multimodal Dataset [7.0430001782867]
We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it.
We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality.
We analyze the MERLOT Reserve, finding that it struggles with image-based questions compared to text and audio, but also with auditory speaker identification.
arXiv Detail & Related papers (2023-07-06T08:02:45Z) - Film Trailer Generation via Task Decomposition [65.16768855902268]
We model movies as graphs, where nodes are shots and edges denote semantic relations between them.
We learn these relations using joint contrastive training which leverages privileged textual information from screenplays.
An unsupervised algorithm then traverses the graph and generates trailers that human judges prefer to ones generated by competitive supervised approaches.
arXiv Detail & Related papers (2021-11-16T20:50:52Z) - Multilevel profiling of situation and dialogue-based deep networks for
movie genre classification using movie trailers [7.904790547594697]
We propose a novel multi-modality: situation, dialogue, and metadata-based movie genre classification framework.
We develop the English movie trailer dataset (EMTD), which contains 2000 Hollywood movie trailers belonging to five popular genres.
arXiv Detail & Related papers (2021-09-14T07:33:56Z) - Learning Trailer Moments in Full-Length Movies [49.74693903050302]
We leverage the officially-released trailers as the weak supervision to learn a model that can detect the key moments from full-length movies.
We introduce a novel ranking network that utilizes the Co-Attention between movies and trailers as guidance to generate the training pairs.
We construct the first movie-trailer dataset, and the proposed Co-Attention assisted ranking network shows superior performance even over the supervised approach.
arXiv Detail & Related papers (2020-08-19T15:23:25Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.