3d human motion generation from the text via gesture action
classification and the autoregressive model
- URL: http://arxiv.org/abs/2211.10003v1
- Date: Fri, 18 Nov 2022 03:05:49 GMT
- Title: 3d human motion generation from the text via gesture action
classification and the autoregressive model
- Authors: Gwantae Kim, Youngsuk Ryu, Junyeop Lee, David K. Han, Jeongmin Bae and
Hanseok Ko
- Abstract summary: The model focuses on generating special gestures that express human thinking, such as waving and nodding.
With several experiments, the proposed method successfully generates perceptually natural and realistic 3D human motion from the text.
- Score: 28.76063248241159
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, a deep learning-based model for 3D human motion generation
from the text is proposed via gesture action classification and an
autoregressive model. The model focuses on generating special gestures that
express human thinking, such as waving and nodding. To achieve the goal, the
proposed method predicts expression from the sentences using a text
classification model based on a pretrained language model and generates
gestures using the gate recurrent unit-based autoregressive model. Especially,
we proposed the loss for the embedding space for restoring raw motions and
generating intermediate motions well. Moreover, the novel data augmentation
method and stop token are proposed to generate variable length motions. To
evaluate the text classification model and 3D human motion generation model, a
gesture action classification dataset and action-based gesture dataset are
collected. With several experiments, the proposed method successfully generates
perceptually natural and realistic 3D human motion from the text. Moreover, we
verified the effectiveness of the proposed method using a public-available
action recognition dataset to evaluate cross-dataset generalization
performance.
Related papers
- MotionFix: Text-Driven 3D Human Motion Editing [52.11745508960547]
Key challenges include the scarcity of training data and the need to design a model that accurately edits the source motion.
We propose a methodology to semi-automatically collect a dataset of triplets comprising (i) a source motion, (ii) a target motion, and (iii) an edit text.
Access to this data allows us to train a conditional diffusion model, TMED, that takes both the source motion and the edit text as input.
arXiv Detail & Related papers (2024-08-01T16:58:50Z) - Semantics-aware Motion Retargeting with Vision-Language Models [19.53696208117539]
We present a novel Semantics-aware Motion reTargeting (SMT) method with the advantage of vision-language models to extract and maintain meaningful motion semantics.
We utilize a differentiable module to render 3D motions and the high-level motion semantics are incorporated into the motion process by feeding the vision-language model and aligning the extracted semantic embeddings.
To ensure the preservation of fine-grained motion details and high-level semantics, we adopt two-stage pipeline consisting of skeleton-aware pre-training and fine-tuning with semantics and geometry constraints.
arXiv Detail & Related papers (2023-12-04T15:23:49Z) - Unsupervised 3D Pose Estimation with Non-Rigid Structure-from-Motion
Modeling [83.76377808476039]
We propose a new modeling method for human pose deformations and design an accompanying diffusion-based motion prior.
Inspired by the field of non-rigid structure-from-motion, we divide the task of reconstructing 3D human skeletons in motion into the estimation of a 3D reference skeleton.
A mixed spatial-temporal NRSfMformer is used to simultaneously estimate the 3D reference skeleton and the skeleton deformation of each frame from 2D observations sequence.
arXiv Detail & Related papers (2023-08-18T16:41:57Z) - Diffusion Action Segmentation [63.061058214427085]
We propose a novel framework via denoising diffusion models, which shares the same inherent spirit of such iterative refinement.
In this framework, action predictions are iteratively generated from random noise with input video features as conditions.
arXiv Detail & Related papers (2023-03-31T10:53:24Z) - Diffusion Motion: Generate Text-Guided 3D Human Motion by Diffusion
Model [7.381316531478522]
We propose a simple and novel method for generating 3D human motion from complex natural language sentences.
We use the Denoising Diffusion Probabilistic Model to generate diverse motion results under the guidance of texts.
Our experiments demonstrate that our model competitive results on HumanML3D test set quantitatively and can generate more visually natural and diverse examples.
arXiv Detail & Related papers (2022-10-22T00:41:17Z) - Multi-level Motion Attention for Human Motion Prediction [132.29963836262394]
We study the use of different types of attention, computed at joint, body part, and full pose levels.
Our experiments on Human3.6M, AMASS and 3DPW validate the benefits of our approach for both periodical and non-periodical actions.
arXiv Detail & Related papers (2021-06-17T08:08:11Z) - HuMoR: 3D Human Motion Model for Robust Pose Estimation [100.55369985297797]
HuMoR is a 3D Human Motion Model for Robust Estimation of temporal pose and shape.
We introduce a conditional variational autoencoder, which learns a distribution of the change in pose at each step of a motion sequence.
We demonstrate that our model generalizes to diverse motions and body shapes after training on a large motion capture dataset.
arXiv Detail & Related papers (2021-05-10T21:04:55Z) - Graph-based Normalizing Flow for Human Motion Generation and
Reconstruction [20.454140530081183]
We propose a probabilistic generative model to synthesize and reconstruct long horizon motion sequences conditioned on past information and control signals.
We evaluate the models on a mixture of motion capture datasets of human locomotion with foot-step and bone-length analysis.
arXiv Detail & Related papers (2021-04-07T09:51:15Z) - History Repeats Itself: Human Motion Prediction via Motion Attention [81.94175022575966]
We introduce an attention-based feed-forward network that explicitly leverages the observation that human motion tends to repeat itself.
In particular, we propose to extract motion attention to capture the similarity between the current motion context and the historical motion sub-sequences.
Our experiments on Human3.6M, AMASS and 3DPW evidence the benefits of our approach for both periodical and non-periodical actions.
arXiv Detail & Related papers (2020-07-23T02:12:27Z) - Moving fast and slow: Analysis of representations and post-processing in
speech-driven automatic gesture generation [7.6857153840014165]
We extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning.
Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates.
We conclude that it is important to take both motion representation and post-processing into account when designing an automatic gesture-production method.
arXiv Detail & Related papers (2020-07-16T07:32:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.