AutoShot: A Short Video Dataset and State-of-the-Art Shot Boundary
Detection
- URL: http://arxiv.org/abs/2304.06116v1
- Date: Wed, 12 Apr 2023 19:01:21 GMT
- Title: AutoShot: A Short Video Dataset and State-of-the-Art Shot Boundary
Detection
- Authors: Wentao Zhu, Yufang Huang, Xiufeng Xie, Wenxian Liu, Jincan Deng,
Debing Zhang, Zhangyang Wang, Ji Liu
- Abstract summary: We release a new public Short video sHot bOundary deTection dataset, named SHOT.
SHOT consists of 853 complete short videos and 11,606 shot annotations, with 2,716 high quality shot boundary annotations in 200 test videos.
Our proposed approach, named AutoShot, achieves higher F1 scores than previous state-of-the-art approaches.
- Score: 70.99025467739715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The short-form videos have explosive popularity and have dominated the new
social media trends. Prevailing short-video platforms,~\textit{e.g.}, Kuaishou
(Kwai), TikTok, Instagram Reels, and YouTube Shorts, have changed the way we
consume and create content. For video content creation and understanding, the
shot boundary detection (SBD) is one of the most essential components in
various scenarios. In this work, we release a new public Short video sHot
bOundary deTection dataset, named SHOT, consisting of 853 complete short videos
and 11,606 shot annotations, with 2,716 high quality shot boundary annotations
in 200 test videos. Leveraging this new data wealth, we propose to optimize the
model design for video SBD, by conducting neural architecture search in a
search space encapsulating various advanced 3D ConvNets and Transformers. Our
proposed approach, named AutoShot, achieves higher F1 scores than previous
state-of-the-art approaches, e.g., outperforming TransNetV2 by 4.2%, when being
derived and evaluated on our newly constructed SHOT dataset. Moreover, to
validate the generalizability of the AutoShot architecture, we directly
evaluate it on another three public datasets: ClipShots, BBC and RAI, and the
F1 scores of AutoShot outperform previous state-of-the-art approaches by 1.1%,
0.9% and 1.2%, respectively. The SHOT dataset and code can be found in
https://github.com/wentaozhu/AutoShot.git .
Related papers
- Building an Open-Vocabulary Video CLIP Model with Better Architectures,
Optimization and Data [102.0069667710562]
This paper presents Open-VCLIP++, a framework that adapts CLIP to a strong zero-shot video classifier.
We demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data.
Our approach is evaluated on three widely used action recognition datasets.
arXiv Detail & Related papers (2023-10-08T04:46:43Z) - Seer: Language Instructed Video Prediction with Latent Diffusion Models [43.708550061909754]
Text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning.
We propose a sample and computation-efficient model, named textbfSeer, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis.
With the adaptable-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames.
arXiv Detail & Related papers (2023-03-27T03:12:24Z) - Expanding Language-Image Pretrained Models for General Video Recognition [136.0948049010682]
Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data.
We present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly.
Our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols.
arXiv Detail & Related papers (2022-08-04T17:59:54Z) - Few-Shot Video Object Detection [70.43402912344327]
We introduce Few-Shot Video Object Detection (FSVOD) with three important contributions.
FSVOD-500 comprises of 500 classes with class-balanced videos in each category for few-shot learning.
Our TPN and TMN+ are jointly and end-to-end trained.
arXiv Detail & Related papers (2021-04-30T07:38:04Z) - Robust 2D/3D Vehicle Parsing in CVIS [54.825777404511605]
We present a novel approach to robustly detect and perceive vehicles in different camera views as part of a cooperative vehicle-infrastructure system (CVIS)
Our formulation is designed for arbitrary camera views and makes no assumptions about intrinsic or extrinsic parameters.
In practice, our approach outperforms SOTA methods on 2D detection, instance segmentation, and 6-DoF pose estimation.
arXiv Detail & Related papers (2021-03-11T03:35:05Z) - Understanding Road Layout from Videos as a Whole [82.30800791500869]
We formulate it as a top-view road attributes prediction problem and our goal is to predict these attributes for each frame both accurately and consistently.
We exploit the following three novel aspects: leveraging camera motions in videos, including context cuesand incorporating long-term video information.
arXiv Detail & Related papers (2020-07-02T00:59:15Z) - Unified Image and Video Saliency Modeling [21.701431656717112]
We ask: Can image and video saliency modeling be approached via a unified model?
We propose four novel domain adaptation techniques and an improved formulation of learned Gaussian priors.
We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data.
We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300.
arXiv Detail & Related papers (2020-03-11T18:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.