Generating Holistic 3D Human Motion from Speech
- URL: http://arxiv.org/abs/2212.04420v2
- Date: Sat, 17 Jun 2023 22:23:13 GMT
- Title: Generating Holistic 3D Human Motion from Speech
- Authors: Hongwei Yi, Hualin Liang, Yifei Liu, Qiong Cao, Yandong Wen, Timo
Bolkart, Dacheng Tao, Michael J. Black
- Abstract summary: We build a high-quality dataset of 3D holistic body meshes with synchronous speech.
We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
- Score: 97.11392166257791
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work addresses the problem of generating 3D holistic body motions from
human speech. Given a speech recording, we synthesize sequences of 3D body
poses, hand gestures, and facial expressions that are realistic and diverse. To
achieve this, we first build a high-quality dataset of 3D holistic body meshes
with synchronous speech. We then define a novel speech-to-motion generation
framework in which the face, body, and hands are modeled separately. The
separated modeling stems from the fact that face articulation strongly
correlates with human speech, while body poses and hand gestures are less
correlated. Specifically, we employ an autoencoder for face motions, and a
compositional vector-quantized variational autoencoder (VQ-VAE) for the body
and hand motions. The compositional VQ-VAE is key to generating diverse
results. Additionally, we propose a cross-conditional autoregressive model that
generates body poses and hand gestures, leading to coherent and realistic
motions. Extensive experiments and user studies demonstrate that our proposed
approach achieves state-of-the-art performance both qualitatively and
quantitatively. Our novel dataset and code will be released for research
purposes at https://talkshow.is.tue.mpg.de.
Related papers
- EmoVOCA: Speech-Driven Emotional 3D Talking Heads [12.161006152509653]
We propose an innovative data-driven technique for creating a synthetic dataset, called EmoVOCA.
We then designed and trained an emotional 3D talking head generator that accepts a 3D face, an audio file, an emotion label, and an intensity value as inputs, and learns to animate the audio-synchronized lip movements with expressive traits of the face.
arXiv Detail & Related papers (2024-03-19T16:33:26Z) - DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with
Diffusion [68.85904927374165]
We propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis.
It captures the complex one-to-many relationships between speech and 3D face based on diffusion.
It simultaneously achieves more realistic facial animation than the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-23T04:14:55Z) - HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes [54.61610144668777]
We present a novel scene-and-language conditioned generative model that can produce 3D human motions in 3D scenes.
Our experiments demonstrate that our model generates diverse and semantically consistent human motions in 3D scenes.
arXiv Detail & Related papers (2022-10-18T10:14:11Z) - A Novel Speech-Driven Lip-Sync Model with CNN and LSTM [12.747541089354538]
We present a combined deep neural network of one-dimensional convolutions and LSTM to generate displacement of a 3D template face model from variable-length speech input.
In order to enhance the robustness of the network to different sound signals, we adapt a trained speech recognition model to extract speech feature.
We show that our model is able to generate smooth and natural lip movements synchronized with speech.
arXiv Detail & Related papers (2022-05-02T13:57:50Z) - TEMOS: Generating diverse human motions from textual descriptions [53.85978336198444]
We address the problem of generating diverse 3D human motions from textual descriptions.
We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data.
We show that TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions.
arXiv Detail & Related papers (2022-04-25T14:53:06Z) - Learning Speech-driven 3D Conversational Gestures from Video [106.15628979352738]
We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures.
Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures.
We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people.
arXiv Detail & Related papers (2021-02-13T01:05:39Z) - Body2Hands: Learning to Infer 3D Hands from Conversational Gesture Body
Dynamics [87.17505994436308]
We build upon the insight that body motion and hand gestures are strongly correlated in non-verbal communication settings.
We formulate the learning of this prior as a prediction task of 3D hand shape over time given body motion input alone.
Our hand prediction model produces convincing 3D hand gestures given only the 3D motion of the speaker's arms as input.
arXiv Detail & Related papers (2020-07-23T22:58:15Z) - Speech2Video Synthesis with 3D Skeleton Regularization and Expressive
Body Poses [36.00309828380724]
We propose a novel approach to convert given speech audio to a photo-realistic speaking video of a specific person.
We achieve this by first generating 3D skeleton movements from the audio sequence using a recurrent neural network (RNN)
To make the skeleton movement realistic and expressive, we embed the knowledge of an articulated 3D human skeleton and a learned dictionary of personal speech iconic gestures into the generation process.
arXiv Detail & Related papers (2020-07-17T19:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.