A Unified Framework for Shot Type Classification Based on Subject
Centric Lens
- URL: http://arxiv.org/abs/2008.03548v1
- Date: Sat, 8 Aug 2020 15:49:40 GMT
- Title: A Unified Framework for Shot Type Classification Based on Subject
Centric Lens
- Authors: Anyi Rao, Jiaze Wang, Linning Xu, Xuekun Jiang, Qingqiu Huang, Bolei
Zhou, Dahua Lin
- Abstract summary: We propose a learning framework for shot type recognition using Subject Guidance Network (SGNet)
SGNet separates the subject and background of a shot into two streams, serving as separate guidance maps for scale and movement type classification respectively.
We build a large-scale dataset MovieShots, which contains 46K shots from 7K movie trailers with annotations of their scale and movement types.
- Score: 89.26211834443558
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Shots are key narrative elements of various videos, e.g. movies, TV series,
and user-generated videos that are thriving over the Internet. The types of
shots greatly influence how the underlying ideas, emotions, and messages are
expressed. The technique to analyze shot types is important to the
understanding of videos, which has seen increasing demand in real-world
applications in this era. Classifying shot type is challenging due to the
additional information required beyond the video content, such as the spatial
composition of a frame and camera movement. To address these issues, we propose
a learning framework Subject Guidance Network (SGNet) for shot type
recognition. SGNet separates the subject and background of a shot into two
streams, serving as separate guidance maps for scale and movement type
classification respectively. To facilitate shot type analysis and model
evaluations, we build a large-scale dataset MovieShots, which contains 46K
shots from 7K movie trailers with annotations of their scale and movement
types. Experiments show that our framework is able to recognize these two
attributes of shot accurately, outperforming all the previous methods.
Related papers
- Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and Summarization [2.31529887566247]
We propose a family of efficient large language vision models (LLVMs) to boost video summarization and captioning.
We show that Shotluck Holmes achieves better performance than state-of-the-art results on the Shot2Story video captioning and summary task.
arXiv Detail & Related papers (2024-05-31T07:30:24Z) - CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects [61.323597069037056]
Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects.
We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects.
arXiv Detail & Related papers (2024-01-18T13:23:51Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - Let's Think Frame by Frame with VIP: A Video Infilling and Prediction
Dataset for Evaluating Video Chain-of-Thought [62.619076257298204]
We motivate framing video reasoning as the sequential understanding of a small number of video reasonings.
We introduce VIP, an inference-time challenge dataset designed to explore models' reasoning capabilities through video chain-of-thought.
We benchmark GPT-4, GPT-3, and VICUNA on VIP, demonstrate the performance gap in complex video reasoning tasks, and encourage future work.
arXiv Detail & Related papers (2023-05-23T10:26:42Z) - Weakly-Supervised Action Detection Guided by Audio Narration [50.4318060593995]
We propose a model to learn from the narration supervision and utilize multimodal features, including RGB, motion flow, and ambient sound.
Our experiments show that noisy audio narration suffices to learn a good action detection model, thus reducing annotation expenses.
arXiv Detail & Related papers (2022-05-12T06:33:24Z) - Movie Genre Classification by Language Augmentation and Shot Sampling [20.119729119879466]
We propose a Movie genre Classification method based on Language augmentatIon and shot samPling (Movie-CLIP)
Movie-CLIP mainly consists of two parts: a language augmentation module to recognize language elements from the input audio, and a shot sampling module to select representative shots from the entire video.
We evaluate our method on MovieNet and Condensed Movies datasets, achieving approximate 6-9% improvement in mean Average Precision (mAP) over the baselines.
arXiv Detail & Related papers (2022-03-24T18:15:12Z) - TNT: Text-Conditioned Network with Transductive Inference for Few-Shot
Video Classification [26.12591949900602]
We formulate a text-based task conditioner to adapt video features to the few-shot learning task.
Our model obtains state-of-the-art performance on four challenging benchmarks in few-shot video action classification.
arXiv Detail & Related papers (2021-06-21T15:08:08Z) - Learning Implicit Temporal Alignment for Few-shot Video Classification [40.57508426481838]
Few-shot video classification aims to learn new video categories with only a few labeled examples.
It is particularly challenging to learn a class-invariant spatial-temporal representation in such a setting.
We propose a novel matching-based few-shot learning strategy for video sequences in this work.
arXiv Detail & Related papers (2021-05-11T07:18:57Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.