TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration
- URL: http://arxiv.org/abs/2304.02419v2
- Date: Sun, 1 Oct 2023 15:23:02 GMT
- Title: TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration
- Authors: Kehong Gong, Dongze Lian, Heng Chang, Chuan Guo, Zihang Jiang, Xinxin
Zuo, Michael Bi Mi, Xinchao Wang
- Abstract summary: We propose a novel task for generating 3D dance movements that simultaneously incorporate both text and music modalities.
Our approach can generate realistic and coherent dance movements conditioned on both text and music while maintaining comparable performance with the two single modalities.
- Score: 75.37311932218773
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel task for generating 3D dance movements that simultaneously
incorporate both text and music modalities. Unlike existing works that generate
dance movements using a single modality such as music, our goal is to produce
richer dance movements guided by the instructive information provided by the
text. However, the lack of paired motion data with both music and text
modalities limits the ability to generate dance movements that integrate both.
To alleviate this challenge, we propose to utilize a 3D human motion VQ-VAE to
project the motions of the two datasets into a latent space consisting of
quantized vectors, which effectively mix the motion tokens from the two
datasets with different distributions for training. Additionally, we propose a
cross-modal transformer to integrate text instructions into motion generation
architecture for generating 3D dance movements without degrading the
performance of music-conditioned dance generation. To better evaluate the
quality of the generated motion, we introduce two novel metrics, namely Motion
Prediction Distance (MPD) and Freezing Score (FS), to measure the coherence and
freezing percentage of the generated motion. Extensive experiments show that
our approach can generate realistic and coherent dance movements conditioned on
both text and music while maintaining comparable performance with the two
single modalities. Code is available at https://garfield-kh.github.io/TM2D/.
Related papers
- UniMuMo: Unified Text, Music and Motion Generation [57.72514622935806]
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities.
By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture.
arXiv Detail & Related papers (2024-10-06T16:04:05Z) - MIDGET: Music Conditioned 3D Dance Generation [13.067687949642641]
We introduce a MusIc conditioned 3D Dance GEneraTion model, named MIDGET, to generate vibrant and highquality dances that match the music rhythm.
To tackle challenges in the field, we introduce three new components: 1) a pre-trained memory codebook based on the Motion VQ-VAE model to store different human pose codes, 2) employing Motion GPT model to generate pose codes with music and motion ablations, and 3) a simple framework for music feature extraction.
arXiv Detail & Related papers (2024-04-18T10:20:37Z) - Duolando: Follower GPT with Off-Policy Reinforcement Learning for Dance Accompaniment [87.20240797625648]
We introduce a novel task within the field of 3D dance generation, termed dance accompaniment.
It requires the generation of responsive movements from a dance partner, the "follower", synchronized with the lead dancer's movements and the underlying musical rhythm.
We propose a GPT-based model, Duolando, which autoregressively predicts the subsequent tokenized motion conditioned on the coordinated information of the music, the leader's and the follower's movements.
arXiv Detail & Related papers (2024-03-27T17:57:02Z) - LM2D: Lyrics- and Music-Driven Dance Synthesis [28.884929875333846]
LM2D is designed to create dance conditioned on both music and lyrics in one diffusion generation step.
We introduce the first 3D dance-motion dataset that encompasses both music and lyrics, obtained with pose estimation technologies.
The results demonstrate LM2D is able to produce realistic and diverse dance matching both lyrics and music.
arXiv Detail & Related papers (2024-03-14T13:59:04Z) - Bidirectional Autoregressive Diffusion Model for Dance Generation [26.449135437337034]
We propose a Bidirectional Autoregressive Diffusion Model (BADM) for music-to-dance generation.
A bidirectional encoder is built to enforce that the generated dance is harmonious in both the forward and backward directions.
To make the generated dance motion smoother, a local information decoder is built for local motion enhancement.
arXiv Detail & Related papers (2024-02-06T19:42:18Z) - BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics [50.88842027976421]
We propose BOTH57M, a novel multi-modal dataset for two-hand motion generation.
Our dataset includes accurate motion tracking for the human body and hands.
We also provide a strong baseline method, BOTH2Hands, for the novel task.
arXiv Detail & Related papers (2023-12-13T07:30:19Z) - Music-to-Dance Generation with Optimal Transport [48.92483627635586]
We propose a Music-to-Dance with Optimal Transport Network (MDOT-Net) for learning to generate 3D dance choreographs from music.
We introduce an optimal transport distance for evaluating the authenticity of the generated dance distribution and a Gromov-Wasserstein distance to measure the correspondence between the dance distribution and the input music.
arXiv Detail & Related papers (2021-12-03T09:37:26Z) - Transflower: probabilistic autoregressive dance generation with
multimodal attention [31.308435764603658]
We present a novel probabilistic autoregressive architecture that models the distribution over future poses with a normalizing flow conditioned on previous poses as well as music context.
Second, we introduce the currently largest 3D dance-motion dataset, obtained with a variety of motion-capture technologies, and including both professional and casual dancers.
arXiv Detail & Related papers (2021-06-25T20:14:28Z) - DanceFormer: Music Conditioned 3D Dance Generation with Parametric
Motion Transformer [23.51701359698245]
In this paper, we reformulate it by a two-stage process, ie, a key pose generation and then an in-between parametric motion curve prediction.
We propose a large-scale music conditioned 3D dance dataset, called PhantomDance, that is accurately labeled by experienced animators.
Experiments demonstrate that the proposed method, even trained by existing datasets, can generate fluent, performative, and music-matched 3D dances.
arXiv Detail & Related papers (2021-03-18T12:17:38Z) - Learning to Generate Diverse Dance Motions with Transformer [67.43270523386185]
We introduce a complete system for dance motion synthesis.
A massive dance motion data set is created from YouTube videos.
A novel two-stream motion transformer generative model can generate motion sequences with high flexibility.
arXiv Detail & Related papers (2020-08-18T22:29:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.