MMM: Generative Masked Motion Model
- URL: http://arxiv.org/abs/2312.03596v2
- Date: Thu, 28 Mar 2024 03:26:51 GMT
- Title: MMM: Generative Masked Motion Model
- Authors: Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee, Chen Chen,
- Abstract summary: MMM is a novel yet simple motion generation paradigm based on Masked Motion Model.
By attending to motion and text tokens in all directions, MMM captures inherent dependency among motion tokens and semantic mapping between motion and text tokens.
MMM is two orders of magnitude faster on a single mid-range GPU than editable motion diffusion models.
- Score: 10.215003912084944
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in text-to-motion generation using diffusion and autoregressive models have shown promising results. However, these models often suffer from a trade-off between real-time performance, high fidelity, and motion editability. To address this gap, we introduce MMM, a novel yet simple motion generation paradigm based on Masked Motion Model. MMM consists of two key components: (1) a motion tokenizer that transforms 3D human motion into a sequence of discrete tokens in latent space, and (2) a conditional masked motion transformer that learns to predict randomly masked motion tokens, conditioned on the pre-computed text tokens. By attending to motion and text tokens in all directions, MMM explicitly captures inherent dependency among motion tokens and semantic mapping between motion and text tokens. During inference, this allows parallel and iterative decoding of multiple motion tokens that are highly consistent with fine-grained text descriptions, therefore simultaneously achieving high-fidelity and high-speed motion generation. In addition, MMM has innate motion editability. By simply placing mask tokens in the place that needs editing, MMM automatically fills the gaps while guaranteeing smooth transitions between editing and non-editing parts. Extensive experiments on the HumanML3D and KIT-ML datasets demonstrate that MMM surpasses current leading methods in generating high-quality motion (evidenced by superior FID scores of 0.08 and 0.429), while offering advanced editing features such as body-part modification, motion in-betweening, and the synthesis of long motion sequences. In addition, MMM is two orders of magnitude faster on a single mid-range GPU than editable motion diffusion models. Our project page is available at \url{https://exitudio.github.io/MMM-page}.
Related papers
- Text-driven Human Motion Generation with Motion Masked Diffusion Model [23.637853270123045]
Text human motion generation is a task that synthesizes human motion sequences conditioned on natural language.
Current diffusion model-based approaches have outstanding performance in the diversity and multimodality of generation.
We propose Motion Masked Diffusion Model bftext(MMDM), a novel human motion mechanism for diffusion model.
arXiv Detail & Related papers (2024-09-29T12:26:24Z) - Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer [62.29951737214263]
Existing algorithms directly generate the full sequence which is expensive and prone to errors.
We propose KeyMotion, that generates plausible human motion sequences corresponding to input text.
We use a Variationalcoder (VAE) with Kullback-Leibler regularization to project the Autoencoder into a latent space.
For the reverse diffusion, we propose a novel Parallel Skip Transformer that performs cross-modal attention between the design latents and text condition.
arXiv Detail & Related papers (2024-05-24T11:12:37Z) - BAMM: Bidirectional Autoregressive Motion Model [14.668729995275807]
Bidirectional Autoregressive Motion Model (BAMM) is a novel text-to-motion generation framework.
BAMM consists of two key components: a motion tokenizer that transforms 3D human motion into discrete tokens in latent space, and a masked self-attention transformer that autoregressively predicts randomly masked tokens.
This feature enables BAMM to simultaneously achieving high-quality motion generation with enhanced usability and built-in motion editability.
arXiv Detail & Related papers (2024-03-28T14:04:17Z) - FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing [56.29102849106382]
FineMoGen is a diffusion-based motion generation and editing framework.
It can synthesize fine-grained motions, with spatial-temporal composition to the user instructions.
FineMoGen further enables zero-shot motion editing capabilities with the aid of modern large language models.
arXiv Detail & Related papers (2023-12-22T16:56:02Z) - OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers [45.808597624491156]
We present OMG, a novel framework, which enables compelling motion generation from zero-shot open-vocabulary text prompts.
At the pre-training stage, our model improves the generation ability by learning the rich out-of-domain inherent motion traits.
At the fine-tuning stage, we introduce motion ControlNet, which incorporates text prompts as conditioning information.
arXiv Detail & Related papers (2023-12-14T14:31:40Z) - Motion Flow Matching for Human Motion Synthesis and Editing [75.13665467944314]
We propose emphMotion Flow Matching, a novel generative model for human motion generation featuring efficient sampling and effectiveness in motion editing applications.
Our method reduces the sampling complexity from thousand steps in previous diffusion models to just ten steps, while achieving comparable performance in text-to-motion and action-to-motion generation benchmarks.
arXiv Detail & Related papers (2023-12-14T12:57:35Z) - DiffusionPhase: Motion Diffusion in Frequency Domain [69.811762407278]
We introduce a learning-based method for generating high-quality human motion sequences from text descriptions.
Existing techniques struggle with motion diversity and smooth transitions in generating arbitrary-length motion sequences.
We develop a network encoder that converts the motion space into a compact yet expressive parameterized phase space.
arXiv Detail & Related papers (2023-12-07T04:39:22Z) - MoMask: Generative Masked Modeling of 3D Human Motions [25.168781728071046]
MoMask is a novel framework for text-driven 3D human motion generation.
A hierarchical quantization scheme is employed to represent human motion as discrete motion tokens.
MoMask outperforms state-of-art methods on the text-to-motion generation task.
arXiv Detail & Related papers (2023-11-29T19:04:10Z) - Synthesizing Long-Term Human Motions with Diffusion Models via Coherent
Sampling [74.62570964142063]
Text-to-motion generation has gained increasing attention, but most existing methods are limited to generating short-term motions.
We propose a novel approach that utilizes a past-conditioned diffusion model with two optional coherent sampling methods.
Our proposed method is capable of generating compositional and coherent long-term 3D human motions controlled by a user-instructed long text stream.
arXiv Detail & Related papers (2023-08-03T16:18:32Z) - TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of
3D Human Motions and Texts [20.336481832461168]
Inspired by the strong ties between vision and language, our paper aims to explore the generation of 3D human full-body motions from texts.
We propose the use of motion token, a discrete and compact motion representation.
Our approach is flexible, could be used for both text2motion and motion2text tasks.
arXiv Detail & Related papers (2022-07-04T19:52:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.