Taming Diffusion Models for Music-driven Conducting Motion Generation
- URL: http://arxiv.org/abs/2306.10065v2
- Date: Mon, 13 Nov 2023 08:44:28 GMT
- Title: Taming Diffusion Models for Music-driven Conducting Motion Generation
- Authors: Zhuoran Zhao, Jinbin Bai, Delong Chen, Debang Wang, Yubo Pan
- Abstract summary: This paper presents Diffusion-Conductor, a novel DDIM-based approach for music-driven conducting motion generation.
We propose a random masking strategy to improve the feature robustness, and use a pair of geometric loss functions to impose additional regularizations.
We also design several novel metrics, including Frechet Gesture Distance (FGD) and Beat Consistency Score (BC) for a more comprehensive evaluation of the generated motion.
- Score: 1.0624606551524207
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating the motion of orchestral conductors from a given piece of symphony
music is a challenging task since it requires a model to learn semantic music
features and capture the underlying distribution of real conducting motion.
Prior works have applied Generative Adversarial Networks (GAN) to this task,
but the promising diffusion model, which recently showed its advantages in
terms of both training stability and output quality, has not been exploited in
this context. This paper presents Diffusion-Conductor, a novel DDIM-based
approach for music-driven conducting motion generation, which integrates the
diffusion model to a two-stage learning framework. We further propose a random
masking strategy to improve the feature robustness, and use a pair of geometric
loss functions to impose additional regularizations and increase motion
diversity. We also design several novel metrics, including Frechet Gesture
Distance (FGD) and Beat Consistency Score (BC) for a more comprehensive
evaluation of the generated motion. Experimental results demonstrate the
advantages of our model.
Related papers
- Energy-Based Diffusion Language Models for Text Generation [126.23425882687195]
Energy-based Diffusion Language Model (EDLM) is an energy-based model operating at the full sequence level for each diffusion step.
Our framework offers a 1.3$times$ sampling speedup over existing diffusion models.
arXiv Detail & Related papers (2024-10-28T17:25:56Z) - ProMotion: Prototypes As Motion Learners [46.08051377180652]
We introduce ProMotion, a unified prototypical framework engineered to model fundamental motion tasks.
ProMotion offers a range of compelling attributes that set it apart from current task-specific paradigms.
We capitalize on a dual mechanism involving the feature denoiser and the prototypical learner to decipher the intricacies of motion.
arXiv Detail & Related papers (2024-06-07T15:10:33Z) - Music Consistency Models [31.415900049111023]
We present Music Consistency Models (textttMusicCM), which leverages the concept of consistency models to efficiently synthesize mel-spectrogram for music clips.
Building upon existing text-to-music diffusion models, the textttMusicCM model incorporates consistency distillation and adversarial discriminator training.
Experimental results reveal the effectiveness of our model in terms of computational efficiency, fidelity, and naturalness.
arXiv Detail & Related papers (2024-04-20T11:52:30Z) - MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation [29.620451579580763]
We propose a novel motion-disentangled diffusion model for talking head generation, dubbed MoDiTalker.
We introduce the two modules: audio-to-motion (AToM), designed to generate a synchronized lip motion from audio, and motion-to-video (MToV), designed to produce high-quality head video following the generated motion.
Our experiments conducted on standard benchmarks demonstrate that our model achieves superior performance compared to existing models.
arXiv Detail & Related papers (2024-03-28T04:35:42Z) - Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs.
SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions.
Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z) - DITTO: Diffusion Inference-Time T-Optimization for Music Generation [49.90109850026932]
Diffusion Inference-Time T-Optimization (DITTO) is a frame-work for controlling pre-trained text-to-music diffusion models at inference-time.
We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control.
arXiv Detail & Related papers (2024-01-22T18:10:10Z) - MotionMix: Weakly-Supervised Diffusion for Controllable Motion
Generation [19.999239668765885]
MotionMix is a weakly-supervised diffusion model that leverages both noisy and unannotated motion sequences.
Our framework consistently achieves state-of-the-art performances on text-to-motion, action-to-motion, and music-to-dance tasks.
arXiv Detail & Related papers (2024-01-20T04:58:06Z) - DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation [89.50310360658791]
We present a novel cascaded motion diffusion model, DiffDance, designed for high-resolution, long-form dance generation.
This model comprises a music-to-dance diffusion model and a sequence super-resolution diffusion model.
We demonstrate that DiffDance is capable of generating realistic dance sequences that align effectively with the input music.
arXiv Detail & Related papers (2023-08-05T16:18:57Z) - Modiff: Action-Conditioned 3D Motion Generation with Denoising Diffusion
Probabilistic Models [58.357180353368896]
We propose a conditional paradigm that benefits from the denoising diffusion probabilistic model (DDPM) to tackle the problem of realistic and diverse action-conditioned 3D skeleton-based motion generation.
We are a pioneering attempt that uses DDPM to synthesize a variable number of motion sequences conditioned on a categorical action.
arXiv Detail & Related papers (2023-01-10T13:15:42Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.