TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation
Abstract Overview
This paper introduces TMD-Bench, a benchmark for text-driven music-dance co-generation that evaluates systems along three axes: unimodal generation quality, instruction adherence, and cross-modal rhythmic alignment. The framework combines low-level computable metrics with high-level MLLM-based judgments for audio, video, and audio-visual synchronization. To support this evaluation, the authors curate a 10k-scale rhythm-aligned music-dance dataset and build a Music Captioner for structured music semantics. The paper also presents RhyJAM, a unified text-to-music-and-dance diffusion model based on flow matching, used as an open-source baseline within the benchmark.
Novelty
The main novelty is a benchmark specifically designed for music-dance co-generation, where fine-grained rhythmic coupling matters more than generic audio-video consistency. The work distinguishes itself by combining beat-centric physical alignment metrics (VBCS and ABHS) with perceptual MLLM judging, and by pairing this evaluation protocol with a rhythm-aligned dataset, a structured Music Captioner, and a unified baseline model.
Results
TMD-Bench shows that commercial audio-video generators achieve strong unimodal audio and video quality, but rhythmic synchronization between music and dance remains inconsistent across systems (e.g., Sora 2 achieves VBCS 0.50 but only ABHS 0.16). RhyJAM attains the strongest reported beat coverage (ABHS 0.27) while matching the top VBCS value of 0.50, and achieves a perceptual alignment score of 0.79 that exceeds all open-source and cascaded baselines. The Music Captioner shows high agreement in semantic labeling, particularly for tempo (0.91) and functional scenes (0.93).
Key Points
- TMD-Bench evaluates music-dance co-generation through a triadic framework covering audio quality, video quality, instruction adherence, and cross-modal rhythmic alignment, using both computable metrics and MLLM-based perceptual judgments.
- The benchmark introduces MDAlign with beat-centric measures (VBCS for beat proximity, ABHS for beat coverage) complemented by MLLM-based perceptual alignment scoring to capture rhythmic coherence beyond pointwise event matching.
- Experiments reveal a persistent gap between strong unimodal generation and reliable rhythm alignment across all tested systems, while RhyJAM achieves the best combined alignment scores among all methods including closed-source systems on the averaged metric (0.59).