Will You Ever Become Popular? Learning to Predict Virality of Dance
Clips
- URL: http://arxiv.org/abs/2111.03819v1
- Date: Sat, 6 Nov 2021 07:26:28 GMT
- Title: Will You Ever Become Popular? Learning to Predict Virality of Dance
Clips
- Authors: Jiahao Wang, Yunhong Wang, Nina Weng, Tianrui Chai, Annan Li, Faxi
Zhang, Sansi Yu
- Abstract summary: We propose a novel multi-modal framework which integrates skeletal, holistic appearance, facial and scenic cues.
To model body movements, we propose a pyramidal skeleton graph convolutional network (PSGCN) which hierarchically refines-temporal skeleton graphs.
To validate our method, we introduce a large-scale viral dance video (VDV) dataset, which contains over 4,000 dance clips of eight viral dance challenges.
- Score: 41.2877440857042
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dance challenges are going viral in video communities like TikTok nowadays.
Once a challenge becomes popular, thousands of short-form videos will be
uploaded in merely a couple of days. Therefore, virality prediction from dance
challenges is of great commercial value and has a wide range of applications,
such as smart recommendation and popularity promotion. In this paper, a novel
multi-modal framework which integrates skeletal, holistic appearance, facial
and scenic cues is proposed for comprehensive dance virality prediction. To
model body movements, we propose a pyramidal skeleton graph convolutional
network (PSGCN) which hierarchically refines spatio-temporal skeleton graphs.
Meanwhile, we introduce a relational temporal convolutional network (RTCN) to
exploit appearance dynamics with non-local temporal relations. An attentive
fusion approach is finally proposed to adaptively aggregate predictions from
different modalities. To validate our method, we introduce a large-scale viral
dance video (VDV) dataset, which contains over 4,000 dance clips of eight viral
dance challenges. Extensive experiments on the VDV dataset demonstrate the
efficacy of our model. Extensive experiments on the VDV dataset well
demonstrate the effectiveness of our approach. Furthermore, we show that short
video applications like multi-dimensional recommendation and action feedback
can be derived from our model.
Related papers
- WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos.
Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions.
We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion.
Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z) - VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers [53.45587477621942]
We propose the first DiT-based video try-on framework for practical in-the-wild applications, named VITON-DiT.
Specifically, VITON-DiT consists of a garment extractor, a Spatial-Temporal denoising DiT, and an identity preservation ControlNet.
We also introduce random selection strategies during training and an Interpolated Auto-Regressive (IAR) technique at inference to facilitate long video generation.
arXiv Detail & Related papers (2024-05-28T16:21:03Z) - ViViD: Video Virtual Try-on using Diffusion Models [46.710863047471264]
Video virtual try-on aims to transfer a clothing item onto the video of a target person.
Previous video-based try-on solutions can only generate low visual quality and blurring results.
We present ViViD, a novel framework employing powerful diffusion models to tackle the task of video virtual try-on.
arXiv Detail & Related papers (2024-05-20T05:28:22Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - MAGVIT: Masked Generative Video Transformer [129.50814875955444]
We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model.
A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains.
arXiv Detail & Related papers (2022-12-10T04:26:32Z) - Deep Video Prior for Video Consistency and Propagation [58.250209011891904]
We present a novel and general approach for blind video temporal consistency.
Our method is only trained on a pair of original and processed videos directly instead of a large dataset.
We show that temporal consistency can be achieved by training a convolutional neural network on a video with Deep Video Prior.
arXiv Detail & Related papers (2022-01-27T16:38:52Z) - Event and Activity Recognition in Video Surveillance for Cyber-Physical
Systems [0.0]
Long-term motion patterns alone play a pivotal role in the task of recognizing an event.
We show that the long-term motion patterns alone play a pivotal role in the task of recognizing an event.
Only the temporal features are exploited using a hybrid Convolutional Neural Network (CNN) + Recurrent Neural Network (RNN) architecture.
arXiv Detail & Related papers (2021-11-03T08:30:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.