Will You Ever Become Popular? Learning to Predict Virality of Dance
Clips
- URL: http://arxiv.org/abs/2111.03819v1
- Date: Sat, 6 Nov 2021 07:26:28 GMT
- Title: Will You Ever Become Popular? Learning to Predict Virality of Dance
Clips
- Authors: Jiahao Wang, Yunhong Wang, Nina Weng, Tianrui Chai, Annan Li, Faxi
Zhang, Sansi Yu
- Abstract summary: We propose a novel multi-modal framework which integrates skeletal, holistic appearance, facial and scenic cues.
To model body movements, we propose a pyramidal skeleton graph convolutional network (PSGCN) which hierarchically refines-temporal skeleton graphs.
To validate our method, we introduce a large-scale viral dance video (VDV) dataset, which contains over 4,000 dance clips of eight viral dance challenges.
- Score: 41.2877440857042
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dance challenges are going viral in video communities like TikTok nowadays.
Once a challenge becomes popular, thousands of short-form videos will be
uploaded in merely a couple of days. Therefore, virality prediction from dance
challenges is of great commercial value and has a wide range of applications,
such as smart recommendation and popularity promotion. In this paper, a novel
multi-modal framework which integrates skeletal, holistic appearance, facial
and scenic cues is proposed for comprehensive dance virality prediction. To
model body movements, we propose a pyramidal skeleton graph convolutional
network (PSGCN) which hierarchically refines spatio-temporal skeleton graphs.
Meanwhile, we introduce a relational temporal convolutional network (RTCN) to
exploit appearance dynamics with non-local temporal relations. An attentive
fusion approach is finally proposed to adaptively aggregate predictions from
different modalities. To validate our method, we introduce a large-scale viral
dance video (VDV) dataset, which contains over 4,000 dance clips of eight viral
dance challenges. Extensive experiments on the VDV dataset demonstrate the
efficacy of our model. Extensive experiments on the VDV dataset well
demonstrate the effectiveness of our approach. Furthermore, we show that short
video applications like multi-dimensional recommendation and action feedback
can be derived from our model.
Related papers
- DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework [26.661935208583756]
virtual try-on (VVT) technology has garnered considerable academic interest owing to its promising applications in e-commerce advertising and entertainment.<n>We propose DreamVVT, which is inherently capable of leveraging diverse unpaired human-centric data to enhance adaptability in real-world scenarios.<n>In the first stage, we sample representative frames from the input video and utilize a multi-frame try-on model integrated with a vision-language model (VLM), to synthesize high-fidelity and semantically consistent try-on images.<n>In the second stage, skeleton maps together with fine-grained motion and appearance descriptions are
arXiv Detail & Related papers (2025-08-04T18:27:55Z) - MVP: Winning Solution to SMP Challenge 2025 Video Track [16.78634288864967]
We present Multimodal Video Predictor (MVP), our winning solution to the Video Track of the SMP Challenge 2025.<n>MVP constructs expressive post representations by integrating deep video features extracted from pretrained models with user metadata and contextual information.<n>Our approach ranked first in the official evaluation of the Video Track, demonstrating its effectiveness and reliability for multimodal video popularity prediction on social platforms.
arXiv Detail & Related papers (2025-07-01T16:52:20Z) - MUFM: A Mamba-Enhanced Feedback Model for Micro Video Popularity Prediction [1.7040391128945196]
We introduce a framework for capturing long-term dependencies in user feedback and dynamic event interactions.
Our experiments on the large-scale open-source multi-modal dataset show that our model significantly outperforms state-of-the-art approaches by 23.2%.
arXiv Detail & Related papers (2024-11-23T05:13:27Z) - WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos.
Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions.
We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion.
Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z) - VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers [53.45587477621942]
We propose the first DiT-based video try-on framework for practical in-the-wild applications, named VITON-DiT.
Specifically, VITON-DiT consists of a garment extractor, a Spatial-Temporal denoising DiT, and an identity preservation ControlNet.
We also introduce random selection strategies during training and an Interpolated Auto-Regressive (IAR) technique at inference to facilitate long video generation.
arXiv Detail & Related papers (2024-05-28T16:21:03Z) - ViViD: Video Virtual Try-on using Diffusion Models [46.710863047471264]
Video virtual try-on aims to transfer a clothing item onto the video of a target person.
Previous video-based try-on solutions can only generate low visual quality and blurring results.
We present ViViD, a novel framework employing powerful diffusion models to tackle the task of video virtual try-on.
arXiv Detail & Related papers (2024-05-20T05:28:22Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - MAGVIT: Masked Generative Video Transformer [129.50814875955444]
We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model.
A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains.
arXiv Detail & Related papers (2022-12-10T04:26:32Z) - Deep Video Prior for Video Consistency and Propagation [58.250209011891904]
We present a novel and general approach for blind video temporal consistency.
Our method is only trained on a pair of original and processed videos directly instead of a large dataset.
We show that temporal consistency can be achieved by training a convolutional neural network on a video with Deep Video Prior.
arXiv Detail & Related papers (2022-01-27T16:38:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.