MovieCuts: A New Dataset and Benchmark for Cut Type Recognition
- URL: http://arxiv.org/abs/2109.05569v1
- Date: Sun, 12 Sep 2021 17:36:55 GMT
- Title: MovieCuts: A New Dataset and Benchmark for Cut Type Recognition
- Authors: Alejandro Pardo, Fabian Caba Heilbron, Juan Le\'on Alc\'azar, Ali
Thabet, Bernard Ghanem
- Abstract summary: This paper introduces the cut type recognition task, which requires modeling of multi-modal information.
We construct a large-scale dataset called MovieCuts, which contains more than 170K videoclips labeled among ten cut types.
Our best model achieves 45.7% mAP, which suggests that the task is challenging and that attaining highly accurate cut type recognition is an open research problem.
- Score: 114.57935905189416
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding movies and their structural patterns is a crucial task to
decode the craft of video editing. While previous works have developed tools
for general analysis such as detecting characters or recognizing cinematography
properties at the shot level, less effort has been devoted to understanding the
most basic video edit, the Cut. This paper introduces the cut type recognition
task, which requires modeling of multi-modal information. To ignite research in
the new task, we construct a large-scale dataset called MovieCuts, which
contains more than 170K videoclips labeled among ten cut types. We benchmark a
series of audio-visual approaches, including some that deal with the problem's
multi-modal and multi-label nature. Our best model achieves 45.7% mAP, which
suggests that the task is challenging and that attaining highly accurate cut
type recognition is an open research problem.
Related papers
- MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval [57.891157692501345]
$textbfMultiVENT 2.0$ is a large-scale, multilingual event-centric video retrieval benchmark.
It features a collection of more than 218,000 news videos and 3,906 queries targeting specific world events.
Preliminary results show that state-of-the-art vision-language models struggle significantly with this task.
arXiv Detail & Related papers (2024-10-15T13:56:34Z) - A Comprehensive Review of Few-shot Action Recognition [64.47305887411275]
Few-shot action recognition aims to address the high cost and impracticality of manually labeling complex and variable video data.
It requires accurately classifying human actions in videos using only a few labeled examples per class.
arXiv Detail & Related papers (2024-07-20T03:53:32Z) - Active Learning for Video Classification with Frame Level Queries [13.135234328352885]
We propose a novel active learning framework for video classification.
Our framework identifies a batch of exemplar videos, together with a set of informative frames for each video.
This involves much less manual work than watching the complete video to come up with a label.
arXiv Detail & Related papers (2023-07-10T15:47:13Z) - MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks.
Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z) - Learning to Cut by Watching Movies [114.57935905189416]
This paper focuses on a new task for computational video editing, namely the task of raking cut plausibility.
Our key idea is to leverage content that has already been edited to learn fine-grained audiovisual patterns that trigger cuts.
We devise a model that learns to discriminate between real and artificial cuts via contrastive learning.
arXiv Detail & Related papers (2021-08-09T18:37:17Z) - TNT: Text-Conditioned Network with Transductive Inference for Few-Shot
Video Classification [26.12591949900602]
We formulate a text-based task conditioner to adapt video features to the few-shot learning task.
Our model obtains state-of-the-art performance on four challenging benchmarks in few-shot video action classification.
arXiv Detail & Related papers (2021-06-21T15:08:08Z) - Highlight Timestamp Detection Model for Comedy Videos via Multimodal
Sentiment Analysis [1.6181085766811525]
We propose a multimodal structure to obtain state-of-the-art performance in this field.
We select several benchmarks for multimodal video understanding and apply the most suitable model to find the best performance.
arXiv Detail & Related papers (2021-05-28T08:39:19Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.