UDE: A Unified Driving Engine for Human Motion Generation
- URL: http://arxiv.org/abs/2211.16016v1
- Date: Tue, 29 Nov 2022 08:30:52 GMT
- Title: UDE: A Unified Driving Engine for Human Motion Generation
- Authors: Zixiang Zhou, Baoyuan Wang
- Abstract summary: UDE is the first unified driving engine that enables generating human motion sequences from natural language or audio sequences.
We evaluate our method on HumanML3DciteGuo_2022_CVPR and AIST++citeli 2021learn benchmarks.
- Score: 16.32286289924454
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating controllable and editable human motion sequences is a key
challenge in 3D Avatar generation. It has been labor-intensive to generate and
animate human motion for a long time until learning-based approaches have been
developed and applied recently. However, these approaches are still
task-specific or modality-specific\cite
{ahuja2019language2pose}\cite{ghosh2021synthesis}\cite{ferreira2021learning}\cite{li2021ai}.
In this paper, we propose ``UDE", the first unified driving engine that enables
generating human motion sequences from natural language or audio sequences (see
Fig.~\ref{fig:teaser}). Specifically, UDE consists of the following key
components: 1) a motion quantization module based on VQVAE that represents
continuous motion sequence as discrete latent code\cite{van2017neural}, 2) a
modality-agnostic transformer encoder\cite{vaswani2017attention} that learns to
map modality-aware driving signals to a joint space, and 3) a unified token
transformer (GPT-like\cite{radford2019language}) network to predict the
quantized latent code index in an auto-regressive manner. 4) a diffusion motion
decoder that takes as input the motion tokens and decodes them into motion
sequences with high diversity. We evaluate our method on
HumanML3D\cite{Guo_2022_CVPR} and AIST++\cite{li2021learn} benchmarks, and the
experiment results demonstrate our method achieves state-of-the-art
performance. Project website: \url{https://github.com/zixiangzhou916/UDE/
Related papers
- Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer [62.29951737214263]
Existing algorithms directly generate the full sequence which is expensive and prone to errors.
We propose KeyMotion, that generates plausible human motion sequences corresponding to input text.
We use a Variationalcoder (VAE) with Kullback-Leibler regularization to project the Autoencoder into a latent space.
For the reverse diffusion, we propose a novel Parallel Skip Transformer that performs cross-modal attention between the design latents and text condition.
arXiv Detail & Related papers (2024-05-24T11:12:37Z) - SpeechAct: Towards Generating Whole-body Motion from Speech [33.10601371020488]
This paper addresses the problem of generating whole-body motion from speech.
We present a novel hybrid point representation to achieve accurate and continuous motion generation.
We also propose a contrastive motion learning method to encourage the model to produce more distinctive representations.
arXiv Detail & Related papers (2023-11-29T07:57:30Z) - Hierarchical Generation of Human-Object Interactions with Diffusion
Probabilistic Models [71.64318025625833]
This paper presents a novel approach to generating the 3D motion of a human interacting with a target object.
Our framework first generates a set of milestones and then synthesizes the motion along them.
The experiments on the NSM, COUCH, and SAMP datasets show that our approach outperforms previous methods by a large margin in both quality and diversity.
arXiv Detail & Related papers (2023-10-03T17:50:23Z) - Co-Speech Gesture Synthesis using Discrete Gesture Token Learning [1.1694169299062596]
Synthesizing realistic co-speech gestures is an important and yet unsolved problem for creating believable motions.
One challenge in learning the co-speech gesture model is that there may be multiple viable gesture motions for the same speech utterance.
We proposed a two-stage model to address this uncertainty issue in gesture synthesis by modeling the gesture segments as discrete latent codes.
arXiv Detail & Related papers (2023-03-04T01:42:09Z) - Executing your Commands via Motion Diffusion in Latent Space [51.64652463205012]
We propose a Motion Latent-based Diffusion model (MLD) to produce vivid motion sequences conforming to the given conditional inputs.
Our MLD achieves significant improvements over the state-of-the-art methods among extensive human motion generation tasks.
arXiv Detail & Related papers (2022-12-08T03:07:00Z) - Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation.
Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics.
We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z) - MotionBERT: A Unified Perspective on Learning Human Motion
Representations [46.67364057245364]
We present a unified perspective on tackling various human-centric video tasks by learning human motion representations from large-scale and heterogeneous data resources.
We propose a pretraining stage in which a motion encoder is trained to recover the underlying 3D motion from noisy partial 2D observations.
We implement motion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer) neural network.
arXiv Detail & Related papers (2022-10-12T19:46:25Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Action-Conditioned 3D Human Motion Synthesis with Transformer VAE [44.523477804533364]
We tackle the problem of action-conditioned generation of realistic and diverse human motion sequences.
In contrast to methods that complete, or extend, motion sequences, this task does not require an initial pose or sequence.
We learn an action-aware latent representation for human motions by training a generative variational autoencoder.
arXiv Detail & Related papers (2021-04-12T17:40:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.