Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model
- URL: http://arxiv.org/abs/2503.22138v1
- Date: Fri, 28 Mar 2025 04:23:03 GMT
- Title: Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model
- Authors: Changchang Sun, Gaowen Liu, Charles Fleming, Yan Yan,
- Abstract summary: Conditional diffusion models have gained increasing attention since their impressive results for cross-modal synthesis.<n>In this paper, we focus on the problem of generating music synchronized with rhythmic visual cues of the given dance video.<n>We propose to enhance the quality of generated music and its synchronization with dance videos by adopting both positive rhythmic information and negative ones.
- Score: 13.693082830029224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conditional diffusion models have gained increasing attention since their impressive results for cross-modal synthesis, where the strong alignment between conditioning input and generated output can be achieved by training a time-conditioned U-Net augmented with cross-attention mechanism. In this paper, we focus on the problem of generating music synchronized with rhythmic visual cues of the given dance video. Considering that bi-directional guidance is more beneficial for training a diffusion model, we propose to enhance the quality of generated music and its synchronization with dance videos by adopting both positive rhythmic information and negative ones (PN-Diffusion) as conditions, where a dual diffusion and reverse processes is devised. Specifically, to train a sequential multi-modal U-Net structure, PN-Diffusion consists of a noise prediction objective for positive conditioning and an additional noise prediction objective for negative conditioning. To accurately define and select both positive and negative conditioning, we ingeniously utilize temporal correlations in dance videos, capturing positive and negative rhythmic cues by playing them forward and backward, respectively. Through subjective and objective evaluations of input-output correspondence in terms of dance-music beat alignment and the quality of generated music, experimental results on the AIST++ and TikTok dance video datasets demonstrate that our model outperforms SOTA dance-to-music generation models.
Related papers
- Expressive Music Data Processing and Generation [4.549093083765949]
Musical expressivity and coherence are indispensable in music composition and performance.<n>We introduce a listening-based data-processing technique that captures the expressivity in musical performance.
arXiv Detail & Related papers (2025-03-14T21:56:07Z) - Efficient Fine-Grained Guidance for Diffusion-Based Symbolic Music Generation [14.156461396686248]
We introduce an efficient Fine-Grained Guidance (FGG) approach within diffusion models.
FGG guides the diffusion models to generate music that aligns more closely with the control and intent of expert composers.
This approach empowers diffusion models to excel in advanced applications such as improvisation, and interactive music creation.
arXiv Detail & Related papers (2024-10-11T00:41:46Z) - Dance Any Beat: Blending Beats with Visuals in Dance Video Generation [12.018432669719742]
We introduce a novel task: generating dance videos directly from images of individuals guided by music.<n>Our solution, the Dance Any Beat Diffusion model (DabFusion), utilizes a reference image and a music piece to generate dance videos.<n>We evaluate DabFusion's performance using the AIST++ dataset, focusing on video quality, audio-video synchronization, and motion-music alignment.
arXiv Detail & Related papers (2024-05-15T11:33:07Z) - Bidirectional Autoregressive Diffusion Model for Dance Generation [26.449135437337034]
We propose a Bidirectional Autoregressive Diffusion Model (BADM) for music-to-dance generation.
A bidirectional encoder is built to enforce that the generated dance is harmonious in both the forward and backward directions.
To make the generated dance motion smoother, a local information decoder is built for local motion enhancement.
arXiv Detail & Related papers (2024-02-06T19:42:18Z) - DITTO: Diffusion Inference-Time T-Optimization for Music Generation [49.90109850026932]
Diffusion Inference-Time T-Optimization (DITTO) is a frame-work for controlling pre-trained text-to-music diffusion models at inference-time.
We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control.
arXiv Detail & Related papers (2024-01-22T18:10:10Z) - DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation [89.50310360658791]
We present a novel cascaded motion diffusion model, DiffDance, designed for high-resolution, long-form dance generation.
This model comprises a music-to-dance diffusion model and a sequence super-resolution diffusion model.
We demonstrate that DiffDance is capable of generating realistic dance sequences that align effectively with the input music.
arXiv Detail & Related papers (2023-08-05T16:18:57Z) - Taming Diffusion Models for Music-driven Conducting Motion Generation [1.0624606551524207]
This paper presents Diffusion-Conductor, a novel DDIM-based approach for music-driven conducting motion generation.
We propose a random masking strategy to improve the feature robustness, and use a pair of geometric loss functions to impose additional regularizations.
We also design several novel metrics, including Frechet Gesture Distance (FGD) and Beat Consistency Score (BC) for a more comprehensive evaluation of the generated motion.
arXiv Detail & Related papers (2023-06-15T03:49:24Z) - Conditional Denoising Diffusion for Sequential Recommendation [62.127862728308045]
Two prominent generative models, Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs)
GANs suffer from unstable optimization, while VAEs are prone to posterior collapse and over-smoothed generations.
We present a conditional denoising diffusion model, which includes a sequence encoder, a cross-attentive denoising decoder, and a step-wise diffuser.
arXiv Detail & Related papers (2023-04-22T15:32:59Z) - BRACE: The Breakdancing Competition Dataset for Dance Motion Synthesis [123.73677487809418]
We introduce a new dataset aiming to challenge common assumptions in dance motion synthesis.
We focus on breakdancing which features acrobatic moves and tangled postures.
Our efforts produced the BRACE dataset, which contains over 3 hours and 30 minutes of densely annotated poses.
arXiv Detail & Related papers (2022-07-20T18:03:54Z) - Quantized GAN for Complex Music Generation from Dance Videos [48.196705493763986]
We present Dance2Music-GAN (D2M-GAN), a novel adversarial multi-modal framework that generates musical samples conditioned on dance videos.
Our proposed framework takes dance video frames and human body motion as input, and learns to generate music samples that plausibly accompany the corresponding input.
arXiv Detail & Related papers (2022-04-01T17:53:39Z) - Music-to-Dance Generation with Optimal Transport [48.92483627635586]
We propose a Music-to-Dance with Optimal Transport Network (MDOT-Net) for learning to generate 3D dance choreographs from music.
We introduce an optimal transport distance for evaluating the authenticity of the generated dance distribution and a Gromov-Wasserstein distance to measure the correspondence between the dance distribution and the input music.
arXiv Detail & Related papers (2021-12-03T09:37:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.