Related papers: 3d human motion generation from the text via gesture action classification and the autoregressive model

3d human motion generation from the text via gesture action classification and the autoregressive model

URL: http://arxiv.org/abs/2211.10003v1
Date: Fri, 18 Nov 2022 03:05:49 GMT
Title: 3d human motion generation from the text via gesture action classification and the autoregressive model
Authors: Gwantae Kim, Youngsuk Ryu, Junyeop Lee, David K. Han, Jeongmin Bae and Hanseok Ko
Abstract summary: The model focuses on generating special gestures that express human thinking, such as waving and nodding. With several experiments, the proposed method successfully generates perceptually natural and realistic 3D human motion from the text.
Score: 28.76063248241159
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, a deep learning-based model for 3D human motion generation from the text is proposed via gesture action classification and an autoregressive model. The model focuses on generating special gestures that express human thinking, such as waving and nodding. To achieve the goal, the proposed method predicts expression from the sentences using a text classification model based on a pretrained language model and generates gestures using the gate recurrent unit-based autoregressive model. Especially, we proposed the loss for the embedding space for restoring raw motions and generating intermediate motions well. Moreover, the novel data augmentation method and stop token are proposed to generate variable length motions. To evaluate the text classification model and 3D human motion generation model, a gesture action classification dataset and action-based gesture dataset are collected. With several experiments, the proposed method successfully generates perceptually natural and realistic 3D human motion from the text. Moreover, we verified the effectiveness of the proposed method using a public-available action recognition dataset to evaluate cross-dataset generalization performance.

Related papers

Gaze-Guided 3D Hand Motion Prediction for Detecting Intent in Egocentric Grasping Tasks [5.018156030818883]
We propose a novel approach that predicts future sequences of both hand poses and joint positions. We use a vector-quantized variational autoencoder for robust hand pose encoding with an autoregressive generative transformer for effective hand motion sequence prediction.
arXiv Detail & Related papers (2025-03-27T15:26:41Z)
MotionFix: Text-Driven 3D Human Motion Editing [52.11745508960547]
Key challenges include the scarcity of training data and the need to design a model that accurately edits the source motion. We propose a methodology to semi-automatically collect a dataset of triplets comprising (i) a source motion, (ii) a target motion, and (iii) an edit text. Access to this data allows us to train a conditional diffusion model, TMED, that takes both the source motion and the edit text as input.
arXiv Detail & Related papers (2024-08-01T16:58:50Z)
Semantics-aware Motion Retargeting with Vision-Language Models [19.53696208117539]
We present a novel Semantics-aware Motion reTargeting (SMT) method with the advantage of vision-language models to extract and maintain meaningful motion semantics. We utilize a differentiable module to render 3D motions and the high-level motion semantics are incorporated into the motion process by feeding the vision-language model and aligning the extracted semantic embeddings. To ensure the preservation of fine-grained motion details and high-level semantics, we adopt two-stage pipeline consisting of skeleton-aware pre-training and fine-tuning with semantics and geometry constraints.
arXiv Detail & Related papers (2023-12-04T15:23:49Z)
Unsupervised 3D Pose Estimation with Non-Rigid Structure-from-Motion Modeling [83.76377808476039]
We propose a new modeling method for human pose deformations and design an accompanying diffusion-based motion prior. Inspired by the field of non-rigid structure-from-motion, we divide the task of reconstructing 3D human skeletons in motion into the estimation of a 3D reference skeleton. A mixed spatial-temporal NRSfMformer is used to simultaneously estimate the 3D reference skeleton and the skeleton deformation of each frame from 2D observations sequence.
arXiv Detail & Related papers (2023-08-18T16:41:57Z)
Diffusion Action Segmentation [63.061058214427085]
We propose a novel framework via denoising diffusion models, which shares the same inherent spirit of such iterative refinement. In this framework, action predictions are iteratively generated from random noise with input video features as conditions.
arXiv Detail & Related papers (2023-03-31T10:53:24Z)
Diffusion Motion: Generate Text-Guided 3D Human Motion by Diffusion Model [7.381316531478522]
We propose a simple and novel method for generating 3D human motion from complex natural language sentences. We use the Denoising Diffusion Probabilistic Model to generate diverse motion results under the guidance of texts. Our experiments demonstrate that our model competitive results on HumanML3D test set quantitatively and can generate more visually natural and diverse examples.
arXiv Detail & Related papers (2022-10-22T00:41:17Z)
Multi-level Motion Attention for Human Motion Prediction [132.29963836262394]
We study the use of different types of attention, computed at joint, body part, and full pose levels. Our experiments on Human3.6M, AMASS and 3DPW validate the benefits of our approach for both periodical and non-periodical actions.
arXiv Detail & Related papers (2021-06-17T08:08:11Z)
HuMoR: 3D Human Motion Model for Robust Pose Estimation [100.55369985297797]
HuMoR is a 3D Human Motion Model for Robust Estimation of temporal pose and shape. We introduce a conditional variational autoencoder, which learns a distribution of the change in pose at each step of a motion sequence. We demonstrate that our model generalizes to diverse motions and body shapes after training on a large motion capture dataset.
arXiv Detail & Related papers (2021-05-10T21:04:55Z)
Graph-based Normalizing Flow for Human Motion Generation and Reconstruction [20.454140530081183]
We propose a probabilistic generative model to synthesize and reconstruct long horizon motion sequences conditioned on past information and control signals. We evaluate the models on a mixture of motion capture datasets of human locomotion with foot-step and bone-length analysis.
arXiv Detail & Related papers (2021-04-07T09:51:15Z)
History Repeats Itself: Human Motion Prediction via Motion Attention [81.94175022575966]
We introduce an attention-based feed-forward network that explicitly leverages the observation that human motion tends to repeat itself. In particular, we propose to extract motion attention to capture the similarity between the current motion context and the historical motion sub-sequences. Our experiments on Human3.6M, AMASS and 3DPW evidence the benefits of our approach for both periodical and non-periodical actions.
arXiv Detail & Related papers (2020-07-23T02:12:27Z)
Moving fast and slow: Analysis of representations and post-processing in speech-driven automatic gesture generation [7.6857153840014165]
We extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates. We conclude that it is important to take both motion representation and post-processing into account when designing an automatic gesture-production method.
arXiv Detail & Related papers (2020-07-16T07:32:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.