Related papers: MoMask: Generative Masked Modeling of 3D Human Motions

MoMask: Generative Masked Modeling of 3D Human Motions

URL: http://arxiv.org/abs/2312.00063v1
Date: Wed, 29 Nov 2023 19:04:10 GMT
Title: MoMask: Generative Masked Modeling of 3D Human Motions
Authors: Chuan Guo and Yuxuan Mu and Muhammad Gohar Javed and Sen Wang and Li Cheng
Abstract summary: MoMask is a novel framework for text-driven 3D human motion generation. A hierarchical quantization scheme is employed to represent human motion as discrete motion tokens. MoMask outperforms state-of-art methods on the text-to-motion generation task.
Score: 25.168781728071046
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce MoMask, a novel masked modeling framework for text-driven 3D human motion generation. In MoMask, a hierarchical quantization scheme is employed to represent human motion as multi-layer discrete motion tokens with high-fidelity details. Starting at the base layer, with a sequence of motion tokens obtained by vector quantization, the residual tokens of increasing orders are derived and stored at the subsequent layers of the hierarchy. This is consequently followed by two distinct bidirectional transformers. For the base-layer motion tokens, a Masked Transformer is designated to predict randomly masked motion tokens conditioned on text input at training stage. During generation (i.e. inference) stage, starting from an empty sequence, our Masked Transformer iteratively fills up the missing tokens; Subsequently, a Residual Transformer learns to progressively predict the next-layer tokens based on the results from current layer. Extensive experiments demonstrate that MoMask outperforms the state-of-art methods on the text-to-motion generation task, with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset, and 0.228 (vs 0.514) on KIT-ML, respectively. MoMask can also be seamlessly applied in related tasks without further model fine-tuning, such as text-guided temporal inpainting.

Related papers

InterMask: 3D Human Interaction Generation via Collaborative Masked Modelling [27.544827331337178]
We introduce InterMask, a novel framework for generating human interactions using masked modeling in discrete space. InterMask utilizes a generative masked modeling framework to collaboratively model the tokens of two interacting individuals. With its enhanced motion representation, dedicated architecture, and effective learning strategy, InterMask achieves high-fidelity and diverse human interactions.
arXiv Detail & Related papers (2024-10-13T21:11:04Z)
Text-driven Human Motion Generation with Motion Masked Diffusion Model [23.637853270123045]
Text human motion generation is a task that synthesizes human motion sequences conditioned on natural language. Current diffusion model-based approaches have outstanding performance in the diversity and multimodality of generation. We propose Motion Masked Diffusion Model bftext(MMDM), a novel human motion mechanism for diffusion model.
arXiv Detail & Related papers (2024-09-29T12:26:24Z)
Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer [62.29951737214263]
Existing algorithms directly generate the full sequence which is expensive and prone to errors. We propose KeyMotion, that generates plausible human motion sequences corresponding to input text. We use a Variationalcoder (VAE) with Kullback-Leibler regularization to project the Autoencoder into a latent space. For the reverse diffusion, we propose a novel Parallel Skip Transformer that performs cross-modal attention between the design latents and text condition.
arXiv Detail & Related papers (2024-05-24T11:12:37Z)
Efficient 3D Instance Mapping and Localization with Neural Fields [39.73128916618561]
We tackle the problem of learning an implicit scene representation for 3D instance segmentation from a sequence of posed RGB images. We introduce 3DIML, a novel framework that efficiently learns a neural label field which can render 3D instance segmentation masks from novel viewpoints.
arXiv Detail & Related papers (2024-03-28T19:25:25Z)
BAMM: Bidirectional Autoregressive Motion Model [14.668729995275807]
Bidirectional Autoregressive Motion Model (BAMM) is a novel text-to-motion generation framework. BAMM consists of two key components: a motion tokenizer that transforms 3D human motion into discrete tokens in latent space, and a masked self-attention transformer that autoregressively predicts randomly masked tokens. This feature enables BAMM to simultaneously achieving high-quality motion generation with enhanced usability and built-in motion editability.
arXiv Detail & Related papers (2024-03-28T14:04:17Z)
MMM: Generative Masked Motion Model [10.215003912084944]
MMM is a novel yet simple motion generation paradigm based on Masked Motion Model. By attending to motion and text tokens in all directions, MMM captures inherent dependency among motion tokens and semantic mapping between motion and text tokens. MMM is two orders of magnitude faster on a single mid-range GPU than editable motion diffusion models.
arXiv Detail & Related papers (2023-12-06T16:35:59Z)
STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition [50.064502884594376]
We study the problem of human action recognition using motion capture (MoCap) sequences. We propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences. The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models.
arXiv Detail & Related papers (2023-03-31T16:19:27Z)
Masked Autoencoding for Scalable and Generalizable Decision Making [93.84855114717062]
MaskDP is a simple and scalable self-supervised pretraining method for reinforcement learning and behavioral cloning. We find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching.
arXiv Detail & Related papers (2022-11-23T07:04:41Z)
Unsupervised Motion Representation Learning with Capsule Autoencoders [54.81628825371412]
Motion Capsule Autoencoder (MCAE) models motion in a two-level hierarchy. MCAE is evaluated on a novel Trajectory20 motion dataset and various real-world skeleton-based human action datasets.
arXiv Detail & Related papers (2021-10-01T16:52:03Z)
Mask Attention Networks: Rethinking and Strengthen Transformer [70.95528238937861]
Transformer is an attention-based neural network, which consists of two sublayers, Self-Attention Network (SAN) and Feed-Forward Network (FFN)
arXiv Detail & Related papers (2021-03-25T04:07:44Z)
UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training [152.63467944568094]
We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks. Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks.
arXiv Detail & Related papers (2020-02-28T15:28:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.