Audio-Driven 3D Facial Animation from In-the-Wild Videos
- URL: http://arxiv.org/abs/2306.11541v1
- Date: Tue, 20 Jun 2023 13:53:05 GMT
- Title: Audio-Driven 3D Facial Animation from In-the-Wild Videos
- Authors: Liying Lu, Tianke Zhang, Yunfei Liu, Xuangeng Chu, Yu Li
- Abstract summary: Given an arbitrary audio clip, audio-driven 3D facial animation aims to generate lifelike lip motions and facial expressions for a 3D head.
Existing methods typically rely on training their models using limited public 3D datasets that contain a restricted number of audio-3D scan pairs.
We propose a novel method that leverages in-the-wild 2D talking-head videos to train our 3D facial animation model.
- Score: 16.76533748243908
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given an arbitrary audio clip, audio-driven 3D facial animation aims to
generate lifelike lip motions and facial expressions for a 3D head. Existing
methods typically rely on training their models using limited public 3D
datasets that contain a restricted number of audio-3D scan pairs. Consequently,
their generalization capability remains limited. In this paper, we propose a
novel method that leverages in-the-wild 2D talking-head videos to train our 3D
facial animation model. The abundance of easily accessible 2D talking-head
videos equips our model with a robust generalization capability. By combining
these videos with existing 3D face reconstruction methods, our model excels in
generating consistent and high-fidelity lip synchronization. Additionally, our
model proficiently captures the speaking styles of different individuals,
allowing it to generate 3D talking-heads with distinct personal styles.
Extensive qualitative and quantitative experimental results demonstrate the
superiority of our method.
Related papers
- MMHead: Towards Fine-grained Multi-modal 3D Facial Animation [68.04052669266174]
We construct a large-scale multi-modal 3D facial animation dataset, MMHead.
MMHead consists of 49 hours of 3D facial motion sequences, speech audios, and rich hierarchical text annotations.
Based on the MMHead dataset, we establish benchmarks for two new tasks: text-induced 3D talking head animation and text-to-3D facial motion generation.
arXiv Detail & Related papers (2024-10-10T09:37:01Z) - NeRFFaceSpeech: One-shot Audio-driven 3D Talking Head Synthesis via Generative Prior [5.819784482811377]
We propose a novel method, NeRFFaceSpeech, which enables to produce high-quality 3D-aware talking head.
Our method can craft a 3D-consistent facial feature space corresponding to a single image.
We also introduce LipaintNet that can replenish the lacking information in the inner-mouth area.
arXiv Detail & Related papers (2024-05-09T13:14:06Z) - EmoVOCA: Speech-Driven Emotional 3D Talking Heads [12.161006152509653]
We propose an innovative data-driven technique for creating a synthetic dataset, called EmoVOCA.
We then designed and trained an emotional 3D talking head generator that accepts a 3D face, an audio file, an emotion label, and an intensity value as inputs, and learns to animate the audio-synchronized lip movements with expressive traits of the face.
arXiv Detail & Related papers (2024-03-19T16:33:26Z) - Media2Face: Co-speech Facial Animation Generation With Multi-Modality
Guidance [41.692420421029695]
We introduce an efficient variational auto-encoder mapping facial geometry and images to a highly generalized expression latent space.
We then use GNPFA to extract high-quality expressions and accurate head poses from a large array of videos.
We propose Media2Face, a diffusion model in GNPFA latent space for co-speech facial animation generation.
arXiv Detail & Related papers (2024-01-28T16:17:59Z) - Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis [88.17520303867099]
One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio.
We present Real3D-Potrait, a framework that improves the one-shot 3D reconstruction power with a large image-to-plane model.
Experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos.
arXiv Detail & Related papers (2024-01-16T17:04:30Z) - AniPortraitGAN: Animatable 3D Portrait Generation from 2D Image
Collections [78.81539337399391]
We present an animatable 3D-aware GAN that generates portrait images with controllable facial expression, head pose, and shoulder movements.
It is a generative model trained on unstructured 2D image collections without using 3D or video data.
A dual-camera rendering and adversarial learning scheme is proposed to improve the quality of the generated faces.
arXiv Detail & Related papers (2023-09-05T12:44:57Z) - DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with
Diffusion [68.85904927374165]
We propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis.
It captures the complex one-to-many relationships between speech and 3D face based on diffusion.
It simultaneously achieves more realistic facial animation than the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-23T04:14:55Z) - SadTalker: Learning Realistic 3D Motion Coefficients for Stylized
Audio-Driven Single Image Talking Face Animation [33.651156455111916]
We present SadTalker, which generates 3D motion coefficients (head pose, expression) of the 3DMM from audio.
Precisely, we present ExpNet to learn the accurate facial expression from audio by distilling both coefficients and 3D-rendered faces.
arXiv Detail & Related papers (2022-11-22T11:35:07Z) - Learning Speech-driven 3D Conversational Gestures from Video [106.15628979352738]
We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures.
Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures.
We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people.
arXiv Detail & Related papers (2021-02-13T01:05:39Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.