T2M-HiFiGPT: Generating High Quality Human Motion from Textual
Descriptions with Residual Discrete Representations
- URL: http://arxiv.org/abs/2312.10628v2
- Date: Sun, 24 Dec 2023 01:31:26 GMT
- Title: T2M-HiFiGPT: Generating High Quality Human Motion from Textual
Descriptions with Residual Discrete Representations
- Authors: Congyi Wang
- Abstract summary: T2M-HiFiGPT is a novel conditional generative framework for synthesizing human motion from textual descriptions.
We demonstrate that our CNN-based RVQ-VAE is capable of producing highly accurate 2D temporal-residual discrete motion representations.
Our findings reveal that RVQ-VAE is more adept at capturing precise 3D human motion with comparable computational demand compared to its VQ-VAE counterparts.
- Score: 0.7614628596146602
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this study, we introduce T2M-HiFiGPT, a novel conditional generative
framework for synthesizing human motion from textual descriptions. This
framework is underpinned by a Residual Vector Quantized Variational AutoEncoder
(RVQ-VAE) and a double-tier Generative Pretrained Transformer (GPT)
architecture. We demonstrate that our CNN-based RVQ-VAE is capable of producing
highly accurate 2D temporal-residual discrete motion representations. Our
proposed double-tier GPT structure comprises a temporal GPT and a residual GPT.
The temporal GPT efficiently condenses information from previous frames and
textual descriptions into a 1D context vector. This vector then serves as a
context prompt for the residual GPT, which generates the final residual
discrete indices. These indices are subsequently transformed back into motion
data by the RVQ-VAE decoder. To mitigate the exposure bias issue, we employ
straightforward code corruption techniques for RVQ and a conditional dropout
strategy, resulting in enhanced synthesis performance. Remarkably, T2M-HiFiGPT
not only simplifies the generative process but also surpasses existing methods
in both performance and parameter efficacy, including the latest
diffusion-based and GPT-based models. On the HumanML3D and KIT-ML datasets, our
framework achieves exceptional results across nearly all primary metrics. We
further validate the efficacy of our framework through comprehensive ablation
studies on the HumanML3D dataset, examining the contribution of each component.
Our findings reveal that RVQ-VAE is more adept at capturing precise 3D human
motion with comparable computational demand compared to its VQ-VAE
counterparts. As a result, T2M-HiFiGPT enables the generation of human motion
with significantly increased accuracy, outperforming recent state-of-the-art
approaches such as T2M-GPT and Att-T2M.
Related papers
- Visual Fourier Prompt Tuning [63.66866445034855]
We propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models.
Our approach incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information.
Our results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2024-11-02T18:18:35Z) - Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation [36.93661496405653]
We take a global approach to exploit Transformer-temporal information with a concise Graph and Skipped Transformer architecture.
Specifically, in 3D pose stage, coarse-grained body parts are deployed to construct a fully data-driven adaptive model.
Experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks.
arXiv Detail & Related papers (2024-07-03T10:42:09Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - F3-Pruning: A Training-Free and Generalized Pruning Strategy towards
Faster and Finer Text-to-Video Synthesis [94.10861578387443]
We explore the inference process of two mainstream T2V models using transformers and diffusion models.
We propose a training-free and generalized pruning strategy called F3-Pruning to prune redundant temporal attention weights.
Extensive experiments on three datasets using a classic transformer-based model CogVideo and a typical diffusion-based model Tune-A-Video verify the effectiveness of F3-Pruning.
arXiv Detail & Related papers (2023-12-06T12:34:47Z) - VST++: Efficient and Stronger Visual Saliency Transformer [74.26078624363274]
We develop an efficient and stronger VST++ model to explore global long-range dependencies.
We evaluate our model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets.
arXiv Detail & Related papers (2023-10-18T05:44:49Z) - Efficient GPT Model Pre-training using Tensor Train Matrix
Representation [65.96485282393361]
Large-scale transformer models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch.
To reduce the number of parameters in the GPT-2 architecture, we replace the matrices of fully-connected layers with the corresponding Train Matrix(TTM) structure.
The resulting GPT-based model stores up to 40% fewer parameters, showing the perplexity comparable to the original model.
arXiv Detail & Related papers (2023-06-05T08:38:25Z) - TSI-GAN: Unsupervised Time Series Anomaly Detection using Convolutional
Cycle-Consistent Generative Adversarial Networks [2.4469484645516837]
Anomaly detection is widely used in network intrusion detection, autonomous driving, medical diagnosis, credit card frauds, etc.
This paper proposes TSI-GAN, an unsupervised anomaly detection model for time-series that can learn complex temporal patterns automatically.
We evaluate TSI-GAN using 250 well-curated and harder-than-usual datasets and compare with 8 state-of-the-art baseline methods.
arXiv Detail & Related papers (2023-03-22T23:24:47Z) - T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete
Representations [34.61255243742796]
We show that a simple CNN-based VQ-VAE with commonly used training recipes (EMA and Code Reset) allows us to obtain high-quality discrete representations.
Despite its simplicity, our T2M-GPT shows better performance than competitive approaches.
arXiv Detail & Related papers (2023-01-15T09:34:42Z) - Leveraging Pre-trained Models for Failure Analysis Triplets Generation [0.0]
We leverage the attention mechanism of pre-trained causal language models such as Transformer model for the downstream task of generating Failure Analysis Triplets (FATs)
We observe that Generative Pre-trained Transformer 2 (GPT2) outperformed other transformer model for the failure analysis triplet generation (FATG) task.
In particular, we observe that GPT2 (trained on 1.5B parameters) outperforms pre-trained BERT, BART and GPT3 by a large margin on ROUGE.
arXiv Detail & Related papers (2022-10-31T17:21:15Z) - Back to MLP: A Simple Baseline for Human Motion Prediction [59.18776744541904]
This paper tackles the problem of human motion prediction, consisting in forecasting future body poses from historically observed sequences.
We show that the performance of these approaches can be surpassed by a light-weight and purely architectural architecture with only 0.14M parameters.
An exhaustive evaluation on Human3.6M, AMASS and 3DPW datasets shows that our method, which we dub siMLPe, consistently outperforms all other approaches.
arXiv Detail & Related papers (2022-07-04T16:35:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.