FuguReport

TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

Authors Xiaoda Yang, Majun Zhang, Changhao Pan, Nick Huang, Yang Yuguang, Fan Zhuo, Pengfei Zhou, Jin Zhou, Sizhe Shan, Shan Yang, Miles Yang, Yang You, Zhou Zhao
Affiliations Zhejiang University / Tencent / National University of Singapore
Categories Evaluation / Multimodal Benchmarking / Music and dance co-generation quality, Evaluation / Generation Quality / Monotony and instruction compliance, Application / Generative Multimedia / Text-driven music-dance synthesis
License CC0 1.0

Abstract Overview

This paper introduces TMD-Bench, a benchmark for text-driven music-dance co-generation that evaluates systems along three axes: unimodal generation quality, instruction adherence, and cross-modal rhythmic alignment. The framework combines low-level computable metrics with high-level MLLM-based judgments for audio, video, and audio-visual synchronization. To support this evaluation, the authors curate a 10k-scale rhythm-aligned music-dance dataset and build a Music Captioner for structured music semantics. The paper also presents RhyJAM, a unified text-to-music-and-dance diffusion model based on flow matching, used as an open-source baseline within the benchmark.

Novelty

The main novelty is a benchmark specifically designed for music-dance co-generation, where fine-grained rhythmic coupling matters more than generic audio-video consistency. The work distinguishes itself by combining beat-centric physical alignment metrics (VBCS and ABHS) with perceptual MLLM judging, and by pairing this evaluation protocol with a rhythm-aligned dataset, a structured Music Captioner, and a unified baseline model.

Results

TMD-Bench shows that commercial audio-video generators achieve strong unimodal audio and video quality, but rhythmic synchronization between music and dance remains inconsistent across systems (e.g., Sora 2 achieves VBCS 0.50 but only ABHS 0.16). RhyJAM attains the strongest reported beat coverage (ABHS 0.27) while matching the top VBCS value of 0.50, and achieves a perceptual alignment score of 0.79 that exceeds all open-source and cascaded baselines. The Music Captioner shows high agreement in semantic labeling, particularly for tempo (0.91) and functional scenes (0.93).

Key Points

  1. TMD-Bench evaluates music-dance co-generation through a triadic framework covering audio quality, video quality, instruction adherence, and cross-modal rhythmic alignment, using both computable metrics and MLLM-based perceptual judgments.
  2. The benchmark introduces MDAlign with beat-centric measures (VBCS for beat proximity, ABHS for beat coverage) complemented by MLLM-based perceptual alignment scoring to capture rhythmic coherence beyond pointwise event matching.
  3. Experiments reveal a persistent gap between strong unimodal generation and reliable rhythm alignment across all tested systems, while RhyJAM achieves the best combined alignment scores among all methods including closed-source systems on the averaged metric (0.59).

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.