T2M-X: Learning Expressive Text-to-Motion Generation from Partially Annotated Data
- URL: http://arxiv.org/abs/2409.13251v1
- Date: Fri, 20 Sep 2024 06:20:00 GMT
- Title: T2M-X: Learning Expressive Text-to-Motion Generation from Partially Annotated Data
- Authors: Mingdian Liu, Yilin Liu, Gurunandan Krishnan, Karl S Bayer, Bing Zhou,
- Abstract summary: Existing methods only generate body motion data, excluding facial expressions and hand movements.
Recent attempts to create such a dataset have resulted in either motion inconsistency among different body parts.
We propose T2M-X, a two-stage method that learns expressive text-to-motion generation from partially annotated data.
- Score: 6.6240820702899565
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The generation of humanoid animation from text prompts can profoundly impact animation production and AR/VR experiences. However, existing methods only generate body motion data, excluding facial expressions and hand movements. This limitation, primarily due to a lack of a comprehensive whole-body motion dataset, inhibits their readiness for production use. Recent attempts to create such a dataset have resulted in either motion inconsistency among different body parts in the artificially augmented data or lower quality in the data extracted from RGB videos. In this work, we propose T2M-X, a two-stage method that learns expressive text-to-motion generation from partially annotated data. T2M-X trains three separate Vector Quantized Variational AutoEncoders (VQ-VAEs) for body, hand, and face on respective high-quality data sources to ensure high-quality motion outputs, and a Multi-indexing Generative Pretrained Transformer (GPT) model with motion consistency loss for motion generation and coordination among different body parts. Our results show significant improvements over the baselines both quantitatively and qualitatively, demonstrating its robustness against the dataset limitations.
Related papers
- MotionFix: Text-Driven 3D Human Motion Editing [52.11745508960547]
Key challenges include the scarcity of training data and the need to design a model that accurately edits the source motion.
We propose a methodology to semi-automatically collect a dataset of triplets comprising (i) a source motion, (ii) a target motion, and (iii) an edit text.
Access to this data allows us to train a conditional diffusion model, TMED, that takes both the source motion and the edit text as input.
arXiv Detail & Related papers (2024-08-01T16:58:50Z) - Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches [12.221087476416056]
We introduce "motion patches", a new representation of motion sequences, and propose using Vision Transformers (ViT) as motion encoders via transfer learning.
These motion patches, created by dividing and sorting skeleton joints based on motion sequences, are robust to varying skeleton structures.
We find that transfer learning with pre-trained weights of ViT obtained through training with 2D image data can boost the performance of motion analysis.
arXiv Detail & Related papers (2024-05-08T02:42:27Z) - BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics [50.88842027976421]
We propose BOTH57M, a novel multi-modal dataset for two-hand motion generation.
Our dataset includes accurate motion tracking for the human body and hands.
We also provide a strong baseline method, BOTH2Hands, for the novel task.
arXiv Detail & Related papers (2023-12-13T07:30:19Z) - OmniMotionGPT: Animal Motion Generation with Limited Data [70.35662376853163]
We introduce AnimalML3D, the first text-animal motion dataset with 1240 animation sequences spanning 36 different animal identities.
We are able to generate animal motions with high diversity and fidelity, quantitatively and qualitatively outperforming the results of training human motion generation baselines on animal data.
arXiv Detail & Related papers (2023-11-30T07:14:00Z) - DiverseMotion: Towards Diverse Human Motion Generation via Discrete
Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions.
We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z) - Make-An-Animation: Large-Scale Text-conditional 3D Human Motion
Generation [47.272177594990104]
We introduce Make-An-Animation, a text-conditioned human motion generation model.
It learns more diverse poses and prompts from large-scale image-text datasets.
It reaches state-of-the-art performance on text-to-motion generation.
arXiv Detail & Related papers (2023-05-16T17:58:43Z) - TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of
3D Human Motions and Texts [20.336481832461168]
Inspired by the strong ties between vision and language, our paper aims to explore the generation of 3D human full-body motions from texts.
We propose the use of motion token, a discrete and compact motion representation.
Our approach is flexible, could be used for both text2motion and motion2text tasks.
arXiv Detail & Related papers (2022-07-04T19:52:18Z) - TEMOS: Generating diverse human motions from textual descriptions [53.85978336198444]
We address the problem of generating diverse 3D human motions from textual descriptions.
We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data.
We show that TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions.
arXiv Detail & Related papers (2022-04-25T14:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.