MMM: Generative Masked Motion Model
        - URL: http://arxiv.org/abs/2312.03596v2
- Date: Thu, 28 Mar 2024 03:26:51 GMT
- Title: MMM: Generative Masked Motion Model
- Authors: Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee, Chen Chen, 
- Abstract summary: MMM is a novel yet simple motion generation paradigm based on Masked Motion Model.
By attending to motion and text tokens in all directions, MMM captures inherent dependency among motion tokens and semantic mapping between motion and text tokens.
MMM is two orders of magnitude faster on a single mid-range GPU than editable motion diffusion models.
- Score: 10.215003912084944
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Recent advances in text-to-motion generation using diffusion and autoregressive models have shown promising results. However, these models often suffer from a trade-off between real-time performance, high fidelity, and motion editability. To address this gap, we introduce MMM, a novel yet simple motion generation paradigm based on Masked Motion Model. MMM consists of two key components: (1) a motion tokenizer that transforms 3D human motion into a sequence of discrete tokens in latent space, and (2) a conditional masked motion transformer that learns to predict randomly masked motion tokens, conditioned on the pre-computed text tokens. By attending to motion and text tokens in all directions, MMM explicitly captures inherent dependency among motion tokens and semantic mapping between motion and text tokens. During inference, this allows parallel and iterative decoding of multiple motion tokens that are highly consistent with fine-grained text descriptions, therefore simultaneously achieving high-fidelity and high-speed motion generation. In addition, MMM has innate motion editability. By simply placing mask tokens in the place that needs editing, MMM automatically fills the gaps while guaranteeing smooth transitions between editing and non-editing parts. Extensive experiments on the HumanML3D and KIT-ML datasets demonstrate that MMM surpasses current leading methods in generating high-quality motion (evidenced by superior FID scores of 0.08 and 0.429), while offering advanced editing features such as body-part modification, motion in-betweening, and the synthesis of long motion sequences. In addition, MMM is two orders of magnitude faster on a single mid-range GPU than editable motion diffusion models. Our project page is available at \url{https://exitudio.github.io/MMM-page}. 
 
      
        Related papers
        - ReMoMask: Retrieval-Augmented Masked Motion Generation [8.471755159366221]
 Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions.<n>We propose ReMoMask, a unified framework integrating three key innovations.<n>A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision.<n>A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts.
 arXiv  Detail & Related papers  (2025-08-04T16:56:35Z)
- M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and   Alternating Optimization for Talking-head Generation [65.08520614570288]
 We reformulate talking head generation into a unified framework comprising video preprocessing, motion representation, and rendering reconstruction.<n>M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness.
 arXiv  Detail & Related papers  (2025-07-11T04:48:12Z)
- Towards Robust and Controllable Text-to-Motion via Masked Autoregressive   Diffusion [33.9786226622757]
 We propose a robust motion generation framework MoMADiff to generate 3D human motion from text descriptions.<n>Our model supports flexible user-provided specification, enabling precise control over both spatial and temporal aspects of motion synthesis.<n>Our method consistently outperforms state-of-the-art models in motion quality, instruction fidelity, and adherence.
 arXiv  Detail & Related papers  (2025-05-16T09:06:15Z)
- MAG: Multi-Modal Aligned Autoregressive Co-Speech Gesture Generation   without Vector Quantization [8.605691647343065]
 This work focuses on full-body co-speech gesture generation. Existing methods typically employ an autoregressive model accompanied by vector-quantized tokens for gesture generation.
We propose MAG, a novel multi-modal aligned framework for high-quality and diverse co-speech gesture synthesis without relying on discrete tokenization.
 arXiv  Detail & Related papers  (2025-03-18T09:02:02Z)
- Mimir: Improving Video Diffusion Models for Precise Text Understanding [53.72393225042688]
 Text serves as the key control signal in video generation due to its narrative nature.
The recent success of large language models (LLMs) showcases the power of decoder-only transformers.
This work addresses this challenge with Mimir, an end-to-end training framework featuring a carefully tailored token fuser.
 arXiv  Detail & Related papers  (2024-12-04T07:26:44Z)
- FTMoMamba: Motion Generation with Frequency and Text State Space Models [53.60865359814126]
 We propose a novel diffusion-based FTMoMamba framework equipped with a Frequency State Space Model and a Text State Space Model.
To learn fine-grained representation, FreqSSM decomposes sequences into low-frequency and high-frequency components.
To ensure the consistency between text and motion, TextSSM encodes text features at the sentence level.
 arXiv  Detail & Related papers  (2024-11-26T15:48:12Z)
- Text-driven Human Motion Generation with Motion Masked Diffusion Model [23.637853270123045]
 Text human motion generation is a task that synthesizes human motion sequences conditioned on natural language.
Current diffusion model-based approaches have outstanding performance in the diversity and multimodality of generation.
We propose Motion Masked Diffusion Model bftext(MMDM), a novel human motion mechanism for diffusion model.
 arXiv  Detail & Related papers  (2024-09-29T12:26:24Z)
- Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip   Transformer [62.29951737214263]
 Existing algorithms directly generate the full sequence which is expensive and prone to errors.
We propose KeyMotion, that generates plausible human motion sequences corresponding to input text.
We use a Variationalcoder (VAE) with Kullback-Leibler regularization to project the Autoencoder into a latent space.
For the reverse diffusion, we propose a novel Parallel Skip Transformer that performs cross-modal attention between the design latents and text condition.
 arXiv  Detail & Related papers  (2024-05-24T11:12:37Z)
- BAMM: Bidirectional Autoregressive Motion Model [14.668729995275807]
 Bidirectional Autoregressive Motion Model (BAMM) is a novel text-to-motion generation framework.
BAMM consists of two key components: a motion tokenizer that transforms 3D human motion into discrete tokens in latent space, and a masked self-attention transformer that autoregressively predicts randomly masked tokens.
This feature enables BAMM to simultaneously achieving high-quality motion generation with enhanced usability and built-in motion editability.
 arXiv  Detail & Related papers  (2024-03-28T14:04:17Z)
- FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing [56.29102849106382]
 FineMoGen is a diffusion-based motion generation and editing framework.
It can synthesize fine-grained motions, with spatial-temporal composition to the user instructions.
FineMoGen further enables zero-shot motion editing capabilities with the aid of modern large language models.
 arXiv  Detail & Related papers  (2023-12-22T16:56:02Z)
- OMG: Towards Open-vocabulary Motion Generation via Mixture of   Controllers [45.808597624491156]
 We present OMG, a novel framework, which enables compelling motion generation from zero-shot open-vocabulary text prompts.
At the pre-training stage, our model improves the generation ability by learning the rich out-of-domain inherent motion traits.
At the fine-tuning stage, we introduce motion ControlNet, which incorporates text prompts as conditioning information.
 arXiv  Detail & Related papers  (2023-12-14T14:31:40Z)
- Motion Flow Matching for Human Motion Synthesis and Editing [75.13665467944314]
 We propose emphMotion Flow Matching, a novel generative model for human motion generation featuring efficient sampling and effectiveness in motion editing applications.
Our method reduces the sampling complexity from thousand steps in previous diffusion models to just ten steps, while achieving comparable performance in text-to-motion and action-to-motion generation benchmarks.
 arXiv  Detail & Related papers  (2023-12-14T12:57:35Z)
- DiffusionPhase: Motion Diffusion in Frequency Domain [69.811762407278]
 We introduce a learning-based method for generating high-quality human motion sequences from text descriptions.
Existing techniques struggle with motion diversity and smooth transitions in generating arbitrary-length motion sequences.
We develop a network encoder that converts the motion space into a compact yet expressive parameterized phase space.
 arXiv  Detail & Related papers  (2023-12-07T04:39:22Z)
- MoMask: Generative Masked Modeling of 3D Human Motions [25.168781728071046]
 MoMask is a novel framework for text-driven 3D human motion generation.
A hierarchical quantization scheme is employed to represent human motion as discrete motion tokens.
MoMask outperforms state-of-art methods on the text-to-motion generation task.
 arXiv  Detail & Related papers  (2023-11-29T19:04:10Z)
- Synthesizing Long-Term Human Motions with Diffusion Models via Coherent
  Sampling [74.62570964142063]
 Text-to-motion generation has gained increasing attention, but most existing methods are limited to generating short-term motions.
We propose a novel approach that utilizes a past-conditioned diffusion model with two optional coherent sampling methods.
Our proposed method is capable of generating compositional and coherent long-term 3D human motions controlled by a user-instructed long text stream.
 arXiv  Detail & Related papers  (2023-08-03T16:18:32Z)
- TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of
  3D Human Motions and Texts [20.336481832461168]
 Inspired by the strong ties between vision and language, our paper aims to explore the generation of 3D human full-body motions from texts.
We propose the use of motion token, a discrete and compact motion representation.
Our approach is flexible, could be used for both text2motion and motion2text tasks.
 arXiv  Detail & Related papers  (2022-07-04T19:52:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.