A Unified Framework for Shot Type Classification Based on Subject
Centric Lens
- URL: http://arxiv.org/abs/2008.03548v1
- Date: Sat, 8 Aug 2020 15:49:40 GMT
- Title: A Unified Framework for Shot Type Classification Based on Subject
Centric Lens
- Authors: Anyi Rao, Jiaze Wang, Linning Xu, Xuekun Jiang, Qingqiu Huang, Bolei
Zhou, Dahua Lin
- Abstract summary: We propose a learning framework for shot type recognition using Subject Guidance Network (SGNet)
SGNet separates the subject and background of a shot into two streams, serving as separate guidance maps for scale and movement type classification respectively.
We build a large-scale dataset MovieShots, which contains 46K shots from 7K movie trailers with annotations of their scale and movement types.
- Score: 89.26211834443558
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Shots are key narrative elements of various videos, e.g. movies, TV series,
and user-generated videos that are thriving over the Internet. The types of
shots greatly influence how the underlying ideas, emotions, and messages are
expressed. The technique to analyze shot types is important to the
understanding of videos, which has seen increasing demand in real-world
applications in this era. Classifying shot type is challenging due to the
additional information required beyond the video content, such as the spatial
composition of a frame and camera movement. To address these issues, we propose
a learning framework Subject Guidance Network (SGNet) for shot type
recognition. SGNet separates the subject and background of a shot into two
streams, serving as separate guidance maps for scale and movement type
classification respectively. To facilitate shot type analysis and model
evaluations, we build a large-scale dataset MovieShots, which contains 46K
shots from 7K movie trailers with annotations of their scale and movement
types. Experiments show that our framework is able to recognize these two
attributes of shot accurately, outperforming all the previous methods.
Related papers
- Can video generation replace cinematographers? Research on the cinematic language of generated video [31.0131670022777]
We propose a threefold approach to enhance the ability of T2V models to generate controllable cinematic language.
We introduce a cinematic language dataset that encompasses shot framing, angle, and camera movement, enabling models to learn diverse cinematic styles.
We then present CameraCLIP, a model fine-tuned on the proposed dataset that excels in understanding complex cinematic language in generated videos.
Finally, we propose CLIPLoRA, a cost-guided dynamic LoRA composition method that facilitates smooth transitions and realistic blending of cinematic language.
arXiv Detail & Related papers (2024-12-16T09:02:24Z) - VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation [70.61101071902596]
Current generation models excel at generating short clips but still struggle with creating multi-shot, movie-like videos.
We propose VideoGen-of-Thought (VGoT), a collaborative and training-free architecture designed specifically for multi-shot video generation.
Our experiments demonstrate that VGoT surpasses existing video generation methods in producing high-quality, coherent, multi-shot videos.
arXiv Detail & Related papers (2024-12-03T08:33:50Z) - MSG score: A Comprehensive Evaluation for Multi-Scene Video Generation [0.0]
This paper addresses the metrics required for generating multi-scene videos based on a continuous scenario.
In video generation, unlike single images, the movement of characters across frames introduces potential issues like distortion or unintended changes.
We propose a score-based evaluation benchmark that automates this process, enabling a more objective and efficient assessment of these complexities.
arXiv Detail & Related papers (2024-11-28T13:11:50Z) - Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and Summarization [2.31529887566247]
We propose a family of efficient large language vision models (LLVMs) to boost video summarization and captioning.
We show that Shotluck Holmes achieves better performance than state-of-the-art results on the Shot2Story video captioning and summary task.
arXiv Detail & Related papers (2024-05-31T07:30:24Z) - CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects [61.323597069037056]
Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects.
We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects.
arXiv Detail & Related papers (2024-01-18T13:23:51Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - Weakly-Supervised Action Detection Guided by Audio Narration [50.4318060593995]
We propose a model to learn from the narration supervision and utilize multimodal features, including RGB, motion flow, and ambient sound.
Our experiments show that noisy audio narration suffices to learn a good action detection model, thus reducing annotation expenses.
arXiv Detail & Related papers (2022-05-12T06:33:24Z) - Movie Genre Classification by Language Augmentation and Shot Sampling [20.119729119879466]
We propose a Movie genre Classification method based on Language augmentatIon and shot samPling (Movie-CLIP)
Movie-CLIP mainly consists of two parts: a language augmentation module to recognize language elements from the input audio, and a shot sampling module to select representative shots from the entire video.
We evaluate our method on MovieNet and Condensed Movies datasets, achieving approximate 6-9% improvement in mean Average Precision (mAP) over the baselines.
arXiv Detail & Related papers (2022-03-24T18:15:12Z) - TNT: Text-Conditioned Network with Transductive Inference for Few-Shot
Video Classification [26.12591949900602]
We formulate a text-based task conditioner to adapt video features to the few-shot learning task.
Our model obtains state-of-the-art performance on four challenging benchmarks in few-shot video action classification.
arXiv Detail & Related papers (2021-06-21T15:08:08Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.