GACA-DiT: Diffusion-based Dance-to-Music Generation with Genre-Adaptive Rhythm and Context-Aware Alignment
- URL: http://arxiv.org/abs/2510.26818v1
- Date: Tue, 28 Oct 2025 09:26:59 GMT
- Title: GACA-DiT: Diffusion-based Dance-to-Music Generation with Genre-Adaptive Rhythm and Context-Aware Alignment
- Authors: Jinting Wang, Chenxing Li, Li Liu,
- Abstract summary: Dance-to-music (D2M) generation aims to automatically compose music that is rhythmically and temporally aligned with dance movements.<n>We propose textbfGACA-DiT, a diffusion transformer-based framework with two novel modules for rhythmically consistent and temporally aligned music generation.<n>Experiments on the AIST++ and TikTok datasets demonstrate that GACA-DiT outperforms state-of-the-art methods in both objective metrics and human evaluation.
- Score: 16.93446224499017
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Dance-to-music (D2M) generation aims to automatically compose music that is rhythmically and temporally aligned with dance movements. Existing methods typically rely on coarse rhythm embeddings, such as global motion features or binarized joint-based rhythm values, which discard fine-grained motion cues and result in weak rhythmic alignment. Moreover, temporal mismatches introduced by feature downsampling further hinder precise synchronization between dance and music. To address these problems, we propose \textbf{GACA-DiT}, a diffusion transformer-based framework with two novel modules for rhythmically consistent and temporally aligned music generation. First, a \textbf{genre-adaptive rhythm extraction} module combines multi-scale temporal wavelet analysis and spatial phase histograms with adaptive joint weighting to capture fine-grained, genre-specific rhythm patterns. Second, a \textbf{context-aware temporal alignment} module resolves temporal mismatches using learnable context queries to align music latents with relevant dance rhythm features. Extensive experiments on the AIST++ and TikTok datasets demonstrate that GACA-DiT outperforms state-of-the-art methods in both objective metrics and human evaluation. Project page: https://beria-moon.github.io/GACA-DiT/.
Related papers
- Listen to Rhythm, Choose Movements: Autoregressive Multimodal Dance Generation via Diffusion and Mamba with Decoupled Dance Dataset [8.721362823189077]
Listen to Rhythm, Choose Movements (LRCM) is a multimodal-guided diffusion framework supporting both diverse input modalities and autoregressive dance motion generation.<n>We will release the full dataset, and pretrained models publicly upon acceptance.
arXiv Detail & Related papers (2026-01-06T14:59:22Z) - Tempo as the Stable Cue: Hierarchical Mixture of Tempo and Beat Experts for Music to 3D Dance Generation [62.82943523102]
Music to 3D dance generation aims to synthesize realistic and rhythmically synchronized human dance from music.<n>We propose TempoMoE, a hierarchical tempo-aware Mixture-of-Experts module.<n>We show that TempoMoE achieves state-of-the-art results in dance quality and rhythm alignment.
arXiv Detail & Related papers (2025-12-21T16:57:08Z) - Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation [26.273309051211204]
Video-to-music (V2M) generation aims to create music that aligns with visual content.<n>We propose Diff-V2M, a general V2M framework based on a hierarchical conditional diffusion model.<n>For rhythm modeling, we begin by evaluating several rhythmic representations, including low-resolution mel-spectrograms, tempograms, and onset detection functions (ODF)
arXiv Detail & Related papers (2025-11-12T08:02:06Z) - DuetGen: Music Driven Two-Person Dance Generation via Hierarchical Masked Modeling [70.79846001735547]
We present DuetGen, a framework for generating interactive two-person dances from music.<n>Inspired by the recent advances in motion synthesis, we propose a two-stage solution.<n>We represent both dancers' motions as a unified whole to learn the necessary motion tokens.
arXiv Detail & Related papers (2025-06-23T14:22:50Z) - MotionRAG-Diff: A Retrieval-Augmented Diffusion Framework for Long-Term Music-to-Dance Generation [10.203209816178552]
MotionRAG-Diff is a hybrid framework that integrates Retrieval-Augmented Generation and diffusion-based refinement.<n>Our method introduces three core innovations.<n>It achieves state-of-the-art performance in motion quality, diversity, and music-motion synchronization accuracy.
arXiv Detail & Related papers (2025-06-03T09:12:48Z) - FTMoMamba: Motion Generation with Frequency and Text State Space Models [53.60865359814126]
We propose a novel diffusion-based FTMoMamba framework equipped with a Frequency State Space Model and a Text State Space Model.
To learn fine-grained representation, FreqSSM decomposes sequences into low-frequency and high-frequency components.
To ensure the consistency between text and motion, TextSSM encodes text features at the sentence level.
arXiv Detail & Related papers (2024-11-26T15:48:12Z) - Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z) - AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment [67.10208647482109]
The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings.
This paper proposes AlignSTS, an STS model based on explicit cross-modal alignment.
Experiments show that AlignSTS achieves superior performance in terms of both objective and subjective metrics.
arXiv Detail & Related papers (2023-05-08T06:02:10Z) - TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration [75.37311932218773]
We propose a novel task for generating 3D dance movements that simultaneously incorporate both text and music modalities.
Our approach can generate realistic and coherent dance movements conditioned on both text and music while maintaining comparable performance with the two single modalities.
arXiv Detail & Related papers (2023-04-05T12:58:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.