Listen to Rhythm, Choose Movements: Autoregressive Multimodal Dance Generation via Diffusion and Mamba with Decoupled Dance Dataset
- URL: http://arxiv.org/abs/2601.03323v1
- Date: Tue, 06 Jan 2026 14:59:22 GMT
- Title: Listen to Rhythm, Choose Movements: Autoregressive Multimodal Dance Generation via Diffusion and Mamba with Decoupled Dance Dataset
- Authors: Oran Duan, Yinghua Shen, Yingzhu Lv, Luyang Jie, Yaxin Liu, Qiong Wu,
- Abstract summary: Listen to Rhythm, Choose Movements (LRCM) is a multimodal-guided diffusion framework supporting both diverse input modalities and autoregressive dance motion generation.<n>We will release the full dataset, and pretrained models publicly upon acceptance.
- Score: 8.721362823189077
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advances in generative models and sequence learning have greatly promoted research in dance motion generation, yet current methods still suffer from coarse semantic control and poor coherence in long sequences. In this work, we present Listen to Rhythm, Choose Movements (LRCM), a multimodal-guided diffusion framework supporting both diverse input modalities and autoregressive dance motion generation. We explore a feature decoupling paradigm for dance datasets and generalize it to the Motorica Dance dataset, separating motion capture data, audio rhythm, and professionally annotated global and local text descriptions. Our diffusion architecture integrates an audio-latent Conformer and a text-latent Cross-Conformer, and incorporates a Motion Temporal Mamba Module (MTMM) to enable smooth, long-duration autoregressive synthesis. Experimental results indicate that LRCM delivers strong performance in both functional capability and quantitative metrics, demonstrating notable potential in multimodal input scenarios and extended sequence generation. We will release the full codebase, dataset, and pretrained models publicly upon acceptance.
Related papers
- Tempo as the Stable Cue: Hierarchical Mixture of Tempo and Beat Experts for Music to 3D Dance Generation [62.82943523102]
Music to 3D dance generation aims to synthesize realistic and rhythmically synchronized human dance from music.<n>We propose TempoMoE, a hierarchical tempo-aware Mixture-of-Experts module.<n>We show that TempoMoE achieves state-of-the-art results in dance quality and rhythm alignment.
arXiv Detail & Related papers (2025-12-21T16:57:08Z) - GACA-DiT: Diffusion-based Dance-to-Music Generation with Genre-Adaptive Rhythm and Context-Aware Alignment [16.93446224499017]
Dance-to-music (D2M) generation aims to automatically compose music that is rhythmically and temporally aligned with dance movements.<n>We propose textbfGACA-DiT, a diffusion transformer-based framework with two novel modules for rhythmically consistent and temporally aligned music generation.<n>Experiments on the AIST++ and TikTok datasets demonstrate that GACA-DiT outperforms state-of-the-art methods in both objective metrics and human evaluation.
arXiv Detail & Related papers (2025-10-28T09:26:59Z) - OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation [52.579531290307926]
This paper introduces OmniMotion-X, a versatile framework for whole-body human motion generation.<n> OmniMotion-X efficiently supports diverse multimodal tasks, including text-to-motion, music-to-dance, speech-to-gesture.<n>To enable high-quality multimodal training, we construct OmniMoCap-X, the largest unified multimodal motion dataset to date.
arXiv Detail & Related papers (2025-10-22T17:25:33Z) - MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing [53.98607267063729]
MotionVerse is a framework to comprehend, generate, and edit human motion in both single-person and multi-person scenarios.<n>We employ a motion tokenizer with residual quantization, which converts continuous motion sequences into multi-stream discrete tokens.<n>We also introduce a textitDelay Parallel Modeling strategy, which temporally staggers the encoding of residual token streams.
arXiv Detail & Related papers (2025-09-28T04:20:56Z) - JointDiff: Bridging Continuous and Discrete in Multi-Agent Trajectory Generation [75.58351043849385]
Generative models often treat continuous data and discrete events as separate processes, creating a gap in modeling complex systems where they unify synchronously.<n>To bridge this gap, we introduce JointDiff, a novel diffusion framework designed to interact these two processes by simultaneously generating continuous-temporal data and synchronous discrete events.<n>JointDiff achieves state-of-the-art performance, demonstrating that joint modeling is crucial for building realistic and controllable models for interactive systems.
arXiv Detail & Related papers (2025-09-26T16:04:00Z) - MotionRAG-Diff: A Retrieval-Augmented Diffusion Framework for Long-Term Music-to-Dance Generation [10.203209816178552]
MotionRAG-Diff is a hybrid framework that integrates Retrieval-Augmented Generation and diffusion-based refinement.<n>Our method introduces three core innovations.<n>It achieves state-of-the-art performance in motion quality, diversity, and music-motion synchronization accuracy.
arXiv Detail & Related papers (2025-06-03T09:12:48Z) - Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion [33.9786226622757]
We propose a robust motion generation framework MoMADiff to generate 3D human motion from text descriptions.<n>Our model supports flexible user-provided specification, enabling precise control over both spatial and temporal aspects of motion synthesis.<n>Our method consistently outperforms state-of-the-art models in motion quality, instruction fidelity, and adherence.
arXiv Detail & Related papers (2025-05-16T09:06:15Z) - MAG: Multi-Modal Aligned Autoregressive Co-Speech Gesture Generation without Vector Quantization [8.605691647343065]
This work focuses on full-body co-speech gesture generation. Existing methods typically employ an autoregressive model accompanied by vector-quantized tokens for gesture generation.<n>We propose MAG, a novel multi-modal aligned framework for high-quality and diverse co-speech gesture synthesis without relying on discrete tokenization.
arXiv Detail & Related papers (2025-03-18T09:02:02Z) - UniMuMo: Unified Text, Music and Motion Generation [57.72514622935806]
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities.
By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture.
arXiv Detail & Related papers (2024-10-06T16:04:05Z) - DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation [89.50310360658791]
We present a novel cascaded motion diffusion model, DiffDance, designed for high-resolution, long-form dance generation.
This model comprises a music-to-dance diffusion model and a sequence super-resolution diffusion model.
We demonstrate that DiffDance is capable of generating realistic dance sequences that align effectively with the input music.
arXiv Detail & Related papers (2023-08-05T16:18:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.