MotionBeat: Motion-Aligned Music Representation via Embodied Contrastive Learning and Bar-Equivariant Contact-Aware Encoding
- URL: http://arxiv.org/abs/2510.13244v1
- Date: Wed, 15 Oct 2025 07:44:32 GMT
- Title: MotionBeat: Motion-Aligned Music Representation via Embodied Contrastive Learning and Bar-Equivariant Contact-Aware Encoding
- Authors: Xuanchen Wang, Heng Wang, Weidong Cai,
- Abstract summary: MotionBeat is a framework for motion-aligned music representation learning.<n>We show that MotionBeat outperforms state-of-the-art audio encoders in music-to-dance generation.
- Score: 13.25040795516169
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Music is both an auditory and an embodied phenomenon, closely linked to human motion and naturally expressed through dance. However, most existing audio representations neglect this embodied dimension, limiting their ability to capture rhythmic and structural cues that drive movement. We propose MotionBeat, a framework for motion-aligned music representation learning. MotionBeat is trained with two newly proposed objectives: the Embodied Contrastive Loss (ECL), an enhanced InfoNCE formulation with tempo-aware and beat-jitter negatives to achieve fine-grained rhythmic discrimination, and the Structural Rhythm Alignment Loss (SRAL), which ensures rhythm consistency by aligning music accents with corresponding motion events. Architecturally, MotionBeat introduces bar-equivariant phase rotations to capture cyclic rhythmic patterns and contact-guided attention to emphasize motion events synchronized with musical accents. Experiments show that MotionBeat outperforms state-of-the-art audio encoders in music-to-dance generation and transfers effectively to beat tracking, music tagging, genre and instrument classification, emotion recognition, and audio-visual retrieval. Our project demo page: https://motionbeat2025.github.io/.
Related papers
- Tempo as the Stable Cue: Hierarchical Mixture of Tempo and Beat Experts for Music to 3D Dance Generation [62.82943523102]
Music to 3D dance generation aims to synthesize realistic and rhythmically synchronized human dance from music.<n>We propose TempoMoE, a hierarchical tempo-aware Mixture-of-Experts module.<n>We show that TempoMoE achieves state-of-the-art results in dance quality and rhythm alignment.
arXiv Detail & Related papers (2025-12-21T16:57:08Z) - MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation [20.517753182293095]
MACE-Dance is a music-driven dance video generation framework with cascaded Mixture-of-appearance (MoE)<n>The Motion Expert performs music-to-3D motion generation while enforcing kinematic plausibility and artistic expressiveness.<n>The Appearance Expert carries out motion- and reference-conditioned video synthesis, preserving visual identity withtemporal coherence.
arXiv Detail & Related papers (2025-12-20T02:34:34Z) - ChoreoMuse: Robust Music-to-Dance Video Generation with Style Transfer and Beat-Adherent Motion [10.21851621470535]
We introduce ChoreoMuse, a diffusion-based framework that uses SMPL format parameters and their variation version as intermediaries between music and video generation.<n>ChoreoMuse supports style-controllable, high-fidelity dance video generation across diverse musical genres and individual dancer characteristics.<n>Our method employs a novel music encoder MotionTune to capture motion cues from audio, ensuring that the generated choreography closely follows the beat and expressive qualities of the input music.
arXiv Detail & Related papers (2025-07-26T07:17:50Z) - Align Your Rhythm: Generating Highly Aligned Dance Poses with Gating-Enhanced Rhythm-Aware Feature Representation [22.729568599120846]
We propose Danceba, a novel framework that leverages gating mechanism to enhance rhythm-aware feature representation.<n>Phase-Based Rhythm Extraction (PRE) to precisely extract rhythmic information from musical phase data.<n>Temporal-Gated Causal Attention (TGCA) to focus on global rhythmic features.<n> Parallel Mamba Motion Modeling (PMMM) architecture to separately model upper and lower body motions.
arXiv Detail & Related papers (2025-03-21T17:42:50Z) - MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization [52.498942604622165]
This paper presents MuVi, a framework to generate music that aligns with video content.
MuVi analyzes video content through a specially designed visual adaptor to extract contextually and temporally relevant features.
We show that MuVi demonstrates superior performance in both audio quality and temporal synchronization.
arXiv Detail & Related papers (2024-10-16T18:44:56Z) - TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration [75.37311932218773]
We propose a novel task for generating 3D dance movements that simultaneously incorporate both text and music modalities.
Our approach can generate realistic and coherent dance movements conditioned on both text and music while maintaining comparable performance with the two single modalities.
arXiv Detail & Related papers (2023-04-05T12:58:33Z) - BRACE: The Breakdancing Competition Dataset for Dance Motion Synthesis [123.73677487809418]
We introduce a new dataset aiming to challenge common assumptions in dance motion synthesis.
We focus on breakdancing which features acrobatic moves and tangled postures.
Our efforts produced the BRACE dataset, which contains over 3 hours and 30 minutes of densely annotated poses.
arXiv Detail & Related papers (2022-07-20T18:03:54Z) - Learning Music-Dance Representations through Explicit-Implicit Rhythm
Synchronization [22.279424952432677]
Music-dance representation can be applied to three downstream tasks: (a) dance classification, (b) music-dance retrieval, and (c) music-dance.
We derive the dance rhythms based on visual appearance and motion cues inspired by the music rhythm analysis. Then the visual rhythms are temporally aligned with the music counterparts, which are extracted by the amplitude of sound intensity.
arXiv Detail & Related papers (2022-07-07T09:44:44Z) - Music-to-Dance Generation with Optimal Transport [48.92483627635586]
We propose a Music-to-Dance with Optimal Transport Network (MDOT-Net) for learning to generate 3D dance choreographs from music.
We introduce an optimal transport distance for evaluating the authenticity of the generated dance distribution and a Gromov-Wasserstein distance to measure the correspondence between the dance distribution and the input music.
arXiv Detail & Related papers (2021-12-03T09:37:26Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.