3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head
- URL: http://arxiv.org/abs/2104.12051v1
- Date: Sun, 25 Apr 2021 02:48:19 GMT
- Title: 3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head
- Authors: Qianyun Wang, Zhenfeng Fan, Shihong Xia
- Abstract summary: We introduce 3D-TalkEmo, a deep neural network that generates 3D talking head animation with various emotions.
We also create a large 3D dataset with synchronized audios and videos, rich corpus, as well as various emotion states of different persons.
- Score: 13.305263646852087
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Impressive progress has been made in audio-driven 3D facial animation
recently, but synthesizing 3D talking-head with rich emotion is still unsolved.
This is due to the lack of 3D generative models and available 3D emotional
dataset with synchronized audios. To address this, we introduce 3D-TalkEmo, a
deep neural network that generates 3D talking head animation with various
emotions. We also create a large 3D dataset with synchronized audios and
videos, rich corpus, as well as various emotion states of different persons
with the sophisticated 3D face reconstruction methods. In the emotion
generation network, we propose a novel 3D face representation structure -
geometry map by classical multi-dimensional scaling analysis. It maps the
coordinates of vertices on a 3D face to a canonical image plane, while
preserving the vertex-to-vertex geodesic distance metric in a least-square
sense. This maintains the adjacency relationship of each vertex and holds the
effective convolutional structure for the 3D facial surface. Taking a neutral
3D mesh and a speech signal as inputs, the 3D-TalkEmo is able to generate vivid
facial animations. Moreover, it provides access to change the emotion state of
the animated speaker.
We present extensive quantitative and qualitative evaluation of our method,
in addition to user studies, demonstrating the generated talking-heads of
significantly higher quality compared to previous state-of-the-art methods.
Related papers
- MMHead: Towards Fine-grained Multi-modal 3D Facial Animation [68.04052669266174]
We construct a large-scale multi-modal 3D facial animation dataset, MMHead.
MMHead consists of 49 hours of 3D facial motion sequences, speech audios, and rich hierarchical text annotations.
Based on the MMHead dataset, we establish benchmarks for two new tasks: text-induced 3D talking head animation and text-to-3D facial motion generation.
arXiv Detail & Related papers (2024-10-10T09:37:01Z) - EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head [30.138347111341748]
We present a novel approach for synthesizing 3D talking heads with controllable emotion.
Our model enables controllable emotion in the generated talking heads and can be rendered in wide-range views.
Experiments demonstrate the effectiveness of our approach in generating high-fidelity and emotion-controllable 3D talking heads.
arXiv Detail & Related papers (2024-08-01T05:46:57Z) - EmoVOCA: Speech-Driven Emotional 3D Talking Heads [12.161006152509653]
We propose an innovative data-driven technique for creating a synthetic dataset, called EmoVOCA.
We then designed and trained an emotional 3D talking head generator that accepts a 3D face, an audio file, an emotion label, and an intensity value as inputs, and learns to animate the audio-synchronized lip movements with expressive traits of the face.
arXiv Detail & Related papers (2024-03-19T16:33:26Z) - Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis [88.17520303867099]
One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio.
We present Real3D-Potrait, a framework that improves the one-shot 3D reconstruction power with a large image-to-plane model.
Experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos.
arXiv Detail & Related papers (2024-01-16T17:04:30Z) - DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with
Diffusion [68.85904927374165]
We propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis.
It captures the complex one-to-many relationships between speech and 3D face based on diffusion.
It simultaneously achieves more realistic facial animation than the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-23T04:14:55Z) - EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation [28.964917860664492]
Speech-driven 3D face animation aims to generate realistic facial expressions that match the speech content and emotion.
This paper proposes an end-to-end neural network to disentangle different emotions in speech so as to generate rich 3D facial expressions.
Our approach outperforms state-of-the-art methods and exhibits more diverse facial movements.
arXiv Detail & Related papers (2023-03-20T13:22:04Z) - SadTalker: Learning Realistic 3D Motion Coefficients for Stylized
Audio-Driven Single Image Talking Face Animation [33.651156455111916]
We present SadTalker, which generates 3D motion coefficients (head pose, expression) of the 3DMM from audio.
Precisely, we present ExpNet to learn the accurate facial expression from audio by distilling both coefficients and 3D-rendered faces.
arXiv Detail & Related papers (2022-11-22T11:35:07Z) - AniFaceGAN: Animatable 3D-Aware Face Image Generation for Video Avatars [71.00322191446203]
2D generative models often suffer from undesirable artifacts when rendering images from different camera viewpoints.
Recently, 3D-aware GANs extend 2D GANs for explicit disentanglement of camera pose by leveraging 3D scene representations.
We propose an animatable 3D-aware GAN for multiview consistent face animation generation.
arXiv Detail & Related papers (2022-10-12T17:59:56Z) - Learning Speech-driven 3D Conversational Gestures from Video [106.15628979352738]
We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures.
Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures.
We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people.
arXiv Detail & Related papers (2021-02-13T01:05:39Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.