Learning Speech-driven 3D Conversational Gestures from Video
- URL: http://arxiv.org/abs/2102.06837v1
- Date: Sat, 13 Feb 2021 01:05:39 GMT
- Title: Learning Speech-driven 3D Conversational Gestures from Video
- Authors: Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, Hans-Peter
Seidel, Gerard Pons-Moll, Mohamed Elgharib, Christian Theobalt
- Abstract summary: We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures.
Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures.
We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people.
- Score: 106.15628979352738
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose the first approach to automatically and jointly synthesize both
the synchronous 3D conversational body and hand gestures, as well as 3D face
and head animations, of a virtual character from speech input. Our algorithm
uses a CNN architecture that leverages the inherent correlation between facial
expression and hand gestures. Synthesis of conversational body gestures is a
multi-modal problem since many similar gestures can plausibly accompany the
same input speech. To synthesize plausible body gestures in this setting, we
train a Generative Adversarial Network (GAN) based model that measures the
plausibility of the generated sequences of 3D body motion when paired with the
input audio features. We also contribute a new way to create a large corpus of
more than 33 hours of annotated body, hand, and face data from in-the-wild
videos of talking people. To this end, we apply state-of-the-art monocular
approaches for 3D body and hand pose estimation as well as dense 3D face
performance capture to the video corpus. In this way, we can train on orders of
magnitude more data than previous algorithms that resort to complex in-studio
motion capture solutions, and thereby train more expressive synthesis
algorithms. Our experiments and user study show the state-of-the-art quality of
our speech-synthesized full 3D character animations.
Related papers
- Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - EmoVOCA: Speech-Driven Emotional 3D Talking Heads [12.161006152509653]
We propose an innovative data-driven technique for creating a synthetic dataset, called EmoVOCA.
We then designed and trained an emotional 3D talking head generator that accepts a 3D face, an audio file, an emotion label, and an intensity value as inputs, and learns to animate the audio-synchronized lip movements with expressive traits of the face.
arXiv Detail & Related papers (2024-03-19T16:33:26Z) - BodyFormer: Semantics-guided 3D Body Gesture Synthesis with Transformer [42.87095473590205]
We propose a novel framework for automatic 3D body gesture synthesis from speech.
Our system is trained with either the Trinity speech-gesture dataset or the Talking With Hands 16.2M dataset.
The results show that our system can produce more realistic, appropriate, and diverse body gestures compared to existing state-of-the-art approaches.
arXiv Detail & Related papers (2023-09-07T01:11:11Z) - DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with
Diffusion [68.85904927374165]
We propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis.
It captures the complex one-to-many relationships between speech and 3D face based on diffusion.
It simultaneously achieves more realistic facial animation than the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-23T04:14:55Z) - Generating Holistic 3D Human Motion from Speech [97.11392166257791]
We build a high-quality dataset of 3D holistic body meshes with synchronous speech.
We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
arXiv Detail & Related papers (2022-12-08T17:25:19Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z) - Body2Hands: Learning to Infer 3D Hands from Conversational Gesture Body
Dynamics [87.17505994436308]
We build upon the insight that body motion and hand gestures are strongly correlated in non-verbal communication settings.
We formulate the learning of this prior as a prediction task of 3D hand shape over time given body motion input alone.
Our hand prediction model produces convincing 3D hand gestures given only the 3D motion of the speaker's arms as input.
arXiv Detail & Related papers (2020-07-23T22:58:15Z) - Speech2Video Synthesis with 3D Skeleton Regularization and Expressive
Body Poses [36.00309828380724]
We propose a novel approach to convert given speech audio to a photo-realistic speaking video of a specific person.
We achieve this by first generating 3D skeleton movements from the audio sequence using a recurrent neural network (RNN)
To make the skeleton movement realistic and expressive, we embed the knowledge of an articulated 3D human skeleton and a learned dictionary of personal speech iconic gestures into the generation process.
arXiv Detail & Related papers (2020-07-17T19:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.