TransFusion: A Practical and Effective Transformer-based Diffusion Model
for 3D Human Motion Prediction
- URL: http://arxiv.org/abs/2307.16106v1
- Date: Sun, 30 Jul 2023 01:52:07 GMT
- Title: TransFusion: A Practical and Effective Transformer-based Diffusion Model
for 3D Human Motion Prediction
- Authors: Sibo Tian, Minghui Zheng, and Xiao Liang
- Abstract summary: We propose TransFusion, an innovative and practical diffusion-based model for 3D human motion prediction.
Our model leverages Transformer as the backbone with long skip connections between shallow and deep layers.
In contrast to prior diffusion-based models that utilize extra modules like cross-attention and adaptive layer normalization, we treat all inputs, including conditions, as tokens to create a more lightweight model.
- Score: 1.8923948104852863
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Predicting human motion plays a crucial role in ensuring a safe and effective
human-robot close collaboration in intelligent remanufacturing systems of the
future. Existing works can be categorized into two groups: those focusing on
accuracy, predicting a single future motion, and those generating diverse
predictions based on observations. The former group fails to address the
uncertainty and multi-modal nature of human motion, while the latter group
often produces motion sequences that deviate too far from the ground truth or
become unrealistic within historical contexts. To tackle these issues, we
propose TransFusion, an innovative and practical diffusion-based model for 3D
human motion prediction which can generate samples that are more likely to
happen while maintaining a certain level of diversity. Our model leverages
Transformer as the backbone with long skip connections between shallow and deep
layers. Additionally, we employ the discrete cosine transform to model motion
sequences in the frequency space, thereby improving performance. In contrast to
prior diffusion-based models that utilize extra modules like cross-attention
and adaptive layer normalization to condition the prediction on past observed
motion, we treat all inputs, including conditions, as tokens to create a more
lightweight model compared to existing approaches. Extensive experimental
studies are conducted on benchmark datasets to validate the effectiveness of
our human motion prediction model.
Related papers
- Multi-Transmotion: Pre-trained Model for Human Motion Prediction [68.87010221355223]
Multi-Transmotion is an innovative transformer-based model designed for cross-modality pre-training.
Our methodology demonstrates competitive performance across various datasets on several downstream tasks.
arXiv Detail & Related papers (2024-11-04T23:15:21Z) - SPOTR: Spatio-temporal Pose Transformers for Human Motion Prediction [12.248428883804763]
3D human motion prediction is a research area computation of high significance and a challenge in computer vision.
Traditionally, autogregressive models have been used to predict human motion.
We present a non-autoregressive model for human motion prediction.
arXiv Detail & Related papers (2023-03-11T01:44:29Z) - Executing your Commands via Motion Diffusion in Latent Space [51.64652463205012]
We propose a Motion Latent-based Diffusion model (MLD) to produce vivid motion sequences conforming to the given conditional inputs.
Our MLD achieves significant improvements over the state-of-the-art methods among extensive human motion generation tasks.
arXiv Detail & Related papers (2022-12-08T03:07:00Z) - Human Joint Kinematics Diffusion-Refinement for Stochastic Motion
Prediction [22.354538952573158]
MotionDiff is a diffusion probabilistic model to treat the kinematics of human joints as heated particles.
MotionDiff consists of two parts: a spatial-temporal transformer-based diffusion network to generate diverse yet plausible motions, and a graph convolutional network to further refine the outputs.
arXiv Detail & Related papers (2022-10-12T07:38:33Z) - A generic diffusion-based approach for 3D human pose prediction in the
wild [68.00961210467479]
3D human pose forecasting, i.e., predicting a sequence of future human 3D poses given a sequence of past observed ones, is a challenging-temporal task.
We provide a unified formulation in which incomplete elements (no matter in the prediction or observation) are treated as noise and propose a conditional diffusion model that denoises them and forecasts plausible poses.
We investigate our findings on four standard datasets and obtain significant improvements over the state-of-the-art.
arXiv Detail & Related papers (2022-10-11T17:59:54Z) - Weakly-supervised Action Transition Learning for Stochastic Human Motion
Prediction [81.94175022575966]
We introduce the task of action-driven human motion prediction.
It aims to predict multiple plausible future motions given a sequence of action labels and a short motion history.
arXiv Detail & Related papers (2022-05-31T08:38:07Z) - HiT-DVAE: Human Motion Generation via Hierarchical Transformer Dynamical
VAE [37.23381308240617]
We propose Hierarchical Transformer Dynamical Variational Autoencoder, HiT-DVAE, which implements auto-regressive generation with transformer-like attention mechanisms.
We evaluate the proposed method on HumanEva-I and Human3.6M with various evaluation methods, and outperform the state-of-the-art methods on most of the metrics.
arXiv Detail & Related papers (2022-04-04T15:12:34Z) - Stochastic Trajectory Prediction via Motion Indeterminacy Diffusion [88.45326906116165]
We present a new framework to formulate the trajectory prediction task as a reverse process of motion indeterminacy diffusion (MID)
We encode the history behavior information and the social interactions as a state embedding and devise a Transformer-based diffusion model to capture the temporal dependencies of trajectories.
Experiments on the human trajectory prediction benchmarks including the Stanford Drone and ETH/UCY datasets demonstrate the superiority of our method.
arXiv Detail & Related papers (2022-03-25T16:59:08Z) - Learning to Predict Diverse Human Motions from a Single Image via
Mixture Density Networks [9.06677862854201]
We propose a novel approach to predict future human motions from a single image, with mixture density networks (MDN) modeling.
Contrary to most existing deep human motion prediction approaches, the multimodal nature of MDN enables the generation of diverse future motion hypotheses.
Our trained model directly takes an image as input and generates multiple plausible motions that satisfy the given condition.
arXiv Detail & Related papers (2021-09-13T08:49:33Z) - Generating Smooth Pose Sequences for Diverse Human Motion Prediction [90.45823619796674]
We introduce a unified deep generative network for both diverse and controllable motion prediction.
Our experiments on two standard benchmark datasets, Human3.6M and HumanEva-I, demonstrate that our approach outperforms the state-of-the-art baselines in terms of both sample diversity and accuracy.
arXiv Detail & Related papers (2021-08-19T00:58:00Z) - FloMo: Tractable Motion Prediction with Normalizing Flows [0.0]
We model motion prediction as a density estimation problem with a normalizing flow between a noise sample and the future motion distribution.
Our model, named FloMo, allows likelihoods to be computed in a single network pass and can be trained directly with maximum likelihood estimation.
Our method achieves state-of-the-art performance on three popular prediction datasets, with a significant gap to most competing models.
arXiv Detail & Related papers (2021-03-05T11:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.