MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls
- URL: http://arxiv.org/abs/2407.21136v3
- Date: Sun, 25 Aug 2024 07:35:04 GMT
- Title: MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls
- Authors: Yuxuan Bian, Ailing Zeng, Xuan Ju, Xian Liu, Zhaoyang Zhang, Wei Liu, Qiang Xu,
- Abstract summary: We propose MotionCraft, a unified diffusion transformer that crafts whole-body motion with plug-and-play multimodal control.
Our framework employs a coarse-to-fine training strategy, starting with the first stage of text-to-motion semantic pre-training.
We introduce MC-Bench, the first available multimodal whole-body motion generation benchmark based on the unified SMPL-X format.
- Score: 30.487510829107908
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Whole-body multimodal motion generation, controlled by text, speech, or music, has numerous applications including video generation and character animation. However, employing a unified model to achieve various generation tasks with different condition modalities presents two main challenges: motion distribution drifts across different tasks (e.g., co-speech gestures and text-driven daily actions) and the complex optimization of mixed conditions with varying granularities (e.g., text and audio). Additionally, inconsistent motion formats across different tasks and datasets hinder effective training toward multimodal motion generation. In this paper, we propose MotionCraft, a unified diffusion transformer that crafts whole-body motion with plug-and-play multimodal control. Our framework employs a coarse-to-fine training strategy, starting with the first stage of text-to-motion semantic pre-training, followed by the second stage of multimodal low-level control adaptation to handle conditions of varying granularities. To effectively learn and transfer motion knowledge across different distributions, we design MC-Attn for parallel modeling of static and dynamic human topology graphs. To overcome the motion format inconsistency of existing benchmarks, we introduce MC-Bench, the first available multimodal whole-body motion generation benchmark based on the unified SMPL-X format. Extensive experiments show that MotionCraft achieves state-of-the-art performance on various standard motion generation tasks.
Related papers
- MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding [76.30210465222218]
MotionGPT-2 is a unified Large Motion-Language Model (LMLMLM)
It supports multimodal control conditions through pre-trained Large Language Models (LLMs)
It is highly adaptable to the challenging 3D holistic motion generation task.
arXiv Detail & Related papers (2024-10-29T05:25:34Z) - Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs [67.59291068131438]
Motion-Agent is a conversational framework designed for general human motion generation, editing, and understanding.
Motion-Agent employs an open-source pre-trained language model to develop a generative agent, MotionLLM, that bridges the gap between motion and text.
arXiv Detail & Related papers (2024-05-27T09:57:51Z) - M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation [78.77004913030285]
M$3$GPT is an advanced $textbfM$ultimodal, $textbfM$ultitask framework for comprehension and generation.
We employ discrete vector quantization for multimodal conditional signals, such as text, music and motion/dance, enabling seamless integration into a large language model.
M$3$GPT learns to model the connections and synergies among various motion-relevant tasks.
arXiv Detail & Related papers (2024-05-25T15:21:59Z) - Large Motion Model for Unified Multi-Modal Motion Generation [50.56268006354396]
Large Motion Model (LMM) is a motion-centric, multi-modal framework that unifies mainstream motion generation tasks into a generalist model.
LMM tackles these challenges from three principled aspects.
arXiv Detail & Related papers (2024-04-01T17:55:11Z) - Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs.
SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions.
Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z) - MoConVQ: Unified Physics-Based Motion Control via Scalable Discrete
Representations [25.630268570049708]
MoConVQ is a novel unified framework for physics-based motion control leveraging scalable discrete representations.
Our approach effectively learns motion embeddings from a large, unstructured dataset spanning tens of hours of motion examples.
arXiv Detail & Related papers (2023-10-16T09:09:02Z) - DiverseMotion: Towards Diverse Human Motion Generation via Discrete
Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions.
We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.