MotionRAG-Diff: A Retrieval-Augmented Diffusion Framework for Long-Term Music-to-Dance Generation
- URL: http://arxiv.org/abs/2506.02661v1
- Date: Tue, 03 Jun 2025 09:12:48 GMT
- Title: MotionRAG-Diff: A Retrieval-Augmented Diffusion Framework for Long-Term Music-to-Dance Generation
- Authors: Mingyang Huang, Peng Zhang, Bang Zhang,
- Abstract summary: MotionRAG-Diff is a hybrid framework that integrates Retrieval-Augmented Generation and diffusion-based refinement.<n>Our method introduces three core innovations.<n>It achieves state-of-the-art performance in motion quality, diversity, and music-motion synchronization accuracy.
- Score: 10.203209816178552
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating long-term, coherent, and realistic music-conditioned dance sequences remains a challenging task in human motion synthesis. Existing approaches exhibit critical limitations: motion graph methods rely on fixed template libraries, restricting creative generation; diffusion models, while capable of producing novel motions, often lack temporal coherence and musical alignment. To address these challenges, we propose $\textbf{MotionRAG-Diff}$, a hybrid framework that integrates Retrieval-Augmented Generation (RAG) with diffusion-based refinement to enable high-quality, musically coherent dance generation for arbitrary long-term music inputs. Our method introduces three core innovations: (1) A cross-modal contrastive learning architecture that aligns heterogeneous music and dance representations in a shared latent space, establishing unsupervised semantic correspondence without paired data; (2) An optimized motion graph system for efficient retrieval and seamless concatenation of motion segments, ensuring realism and temporal coherence across long sequences; (3) A multi-condition diffusion model that jointly conditions on raw music signals and contrastive features to enhance motion quality and global synchronization. Extensive experiments demonstrate that MotionRAG-Diff achieves state-of-the-art performance in motion quality, diversity, and music-motion synchronization accuracy. This work establishes a new paradigm for music-driven dance generation by synergizing retrieval-based template fidelity with diffusion-based creative enhancement.
Related papers
- PAMD: Plausibility-Aware Motion Diffusion Model for Long Dance Generation [51.2555550979386]
Plausibility-Aware Motion Diffusion (PAMD) is a framework for generating dances that are both musically aligned and physically realistic.<n>To provide more effective guidance during generation, we incorporate Prior Motion Guidance (PMG)<n>Experiments show that PAMD significantly improves musical alignment and enhances the physical plausibility of generated motions.
arXiv Detail & Related papers (2025-05-26T14:44:09Z) - AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars [65.53676584955686]
Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans.<n>We propose AsynFusion, a novel framework that leverages diffusion transformers to achieve cohesive expression and gesture synthesis.<n>AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations.
arXiv Detail & Related papers (2025-05-21T03:28:53Z) - ReactDance: Progressive-Granular Representation for Long-Term Coherent Reactive Dance Generation [2.1920014462753064]
Reactive dance generation (RDG) produces follower movements conditioned on guiding dancer and music.<n>We present ReactDance, a novel diffusion-based framework for high-fidelity RDG with long-term coherence and multi-scale controllability.
arXiv Detail & Related papers (2025-05-08T18:42:38Z) - GENMO: A GENeralist Model for Human MOtion [64.16188966024542]
We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework.<n>Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals.<n>Our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control.
arXiv Detail & Related papers (2025-05-02T17:59:55Z) - Extending Visual Dynamics for Video-to-Music Generation [51.274561293909926]
DyViM is a novel framework to enhance dynamics modeling for video-to-music generation.<n>High-level semantics are conveyed through a cross-attention mechanism.<n>Experiments demonstrate DyViM's superiority over state-of-the-art (SOTA) methods.
arXiv Detail & Related papers (2025-04-10T09:47:26Z) - InterDance:Reactive 3D Dance Generation with Realistic Duet Interactions [67.37790144477503]
We propose InterDance, a large-scale duet dance dataset that significantly enhances motion quality, data scale, and the variety of dance genres.<n>We introduce a diffusion-based framework with an interaction refinement guidance strategy to optimize the realism of interactions progressively.
arXiv Detail & Related papers (2024-12-22T11:53:51Z) - LongDanceDiff: Long-term Dance Generation with Conditional Diffusion
Model [3.036230795326545]
LongDanceDiff is a conditional diffusion model for sequence-to-sequence long-term dance generation.
It addresses the challenges of temporal coherency and spatial constraint.
We also address common visual quality issues in dance generation, such as foot sliding and unsmooth motion.
arXiv Detail & Related papers (2023-08-23T06:37:41Z) - DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation [89.50310360658791]
We present a novel cascaded motion diffusion model, DiffDance, designed for high-resolution, long-form dance generation.
This model comprises a music-to-dance diffusion model and a sequence super-resolution diffusion model.
We demonstrate that DiffDance is capable of generating realistic dance sequences that align effectively with the input music.
arXiv Detail & Related papers (2023-08-05T16:18:57Z) - Taming Diffusion Models for Music-driven Conducting Motion Generation [1.0624606551524207]
This paper presents Diffusion-Conductor, a novel DDIM-based approach for music-driven conducting motion generation.
We propose a random masking strategy to improve the feature robustness, and use a pair of geometric loss functions to impose additional regularizations.
We also design several novel metrics, including Frechet Gesture Distance (FGD) and Beat Consistency Score (BC) for a more comprehensive evaluation of the generated motion.
arXiv Detail & Related papers (2023-06-15T03:49:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.