T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete
Representations
- URL: http://arxiv.org/abs/2301.06052v4
- Date: Sun, 24 Sep 2023 17:00:32 GMT
- Title: T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete
Representations
- Authors: Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong
Zhang, Hongwei Zhao, Hongtao Lu and Xi Shen
- Abstract summary: We show that a simple CNN-based VQ-VAE with commonly used training recipes (EMA and Code Reset) allows us to obtain high-quality discrete representations.
Despite its simplicity, our T2M-GPT shows better performance than competitive approaches.
- Score: 34.61255243742796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we investigate a simple and must-known conditional generative
framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) and
Generative Pre-trained Transformer (GPT) for human motion generation from
textural descriptions. We show that a simple CNN-based VQ-VAE with commonly
used training recipes (EMA and Code Reset) allows us to obtain high-quality
discrete representations. For GPT, we incorporate a simple corruption strategy
during the training to alleviate training-testing discrepancy. Despite its
simplicity, our T2M-GPT shows better performance than competitive approaches,
including recent diffusion-based approaches. For example, on HumanML3D, which
is currently the largest dataset, we achieve comparable performance on the
consistency between text and generated motion (R-Precision), but with FID 0.116
largely outperforming MotionDiffuse of 0.630. Additionally, we conduct analyses
on HumanML3D and observe that the dataset size is a limitation of our approach.
Our work suggests that VQ-VAE still remains a competitive approach for human
motion generation.
Related papers
- Multi-Transmotion: Pre-trained Model for Human Motion Prediction [68.87010221355223]
Multi-Transmotion is an innovative transformer-based model designed for cross-modality pre-training.
Our methodology demonstrates competitive performance across various datasets on several downstream tasks.
arXiv Detail & Related papers (2024-11-04T23:15:21Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - T2M-HiFiGPT: Generating High Quality Human Motion from Textual
Descriptions with Residual Discrete Representations [0.7614628596146602]
T2M-HiFiGPT is a novel conditional generative framework for synthesizing human motion from textual descriptions.
We demonstrate that our CNN-based RVQ-VAE is capable of producing highly accurate 2D temporal-residual discrete motion representations.
Our findings reveal that RVQ-VAE is more adept at capturing precise 3D human motion with comparable computational demand compared to its VQ-VAE counterparts.
arXiv Detail & Related papers (2023-12-17T06:58:31Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion
Synthesis [59.465092047829835]
We present TMR, a simple yet effective approach for text to 3D human motion retrieval.
Our method extends the state-of-the-art text-to-motion synthesis model TEMOS.
We show that maintaining the motion generation loss, along with the contrastive training, is crucial to obtain good performance.
arXiv Detail & Related papers (2023-05-02T17:52:41Z) - GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language
Pre-training [47.95914618851596]
We show that two routinely applied steps during pre-training have crucial impact on the performance of the pre-trained model.
We propose a new vision and language pre-training method, which adaptively samples mini-batches for more effective mining of hard negative samples for ITM.
Our method achieves a new state-of-the-art performance on various downstream tasks with much less computational cost.
arXiv Detail & Related papers (2022-08-08T11:15:45Z) - Back to MLP: A Simple Baseline for Human Motion Prediction [59.18776744541904]
This paper tackles the problem of human motion prediction, consisting in forecasting future body poses from historically observed sequences.
We show that the performance of these approaches can be surpassed by a light-weight and purely architectural architecture with only 0.14M parameters.
An exhaustive evaluation on Human3.6M, AMASS and 3DPW datasets shows that our method, which we dub siMLPe, consistently outperforms all other approaches.
arXiv Detail & Related papers (2022-07-04T16:35:58Z) - ProFormer: Learning Data-efficient Representations of Body Movement with
Prototype-based Feature Augmentation and Visual Transformers [31.908276711898548]
Methods for data-efficient recognition from body poses increasingly leverage skeleton sequences structured as image-like arrays.
We look at this paradigm from the perspective of transformer networks, for the first time exploring visual transformers as data-efficient encoders of skeleton movement.
In our pipeline, body pose sequences cast as image-like representations are converted into patch embeddings and then passed to a visual transformer backbone optimized with deep metric learning.
arXiv Detail & Related papers (2022-02-23T11:11:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.