MMFace4D: A Large-Scale Multi-Modal 4D Face Dataset for Audio-Driven 3D
Face Animation
- URL: http://arxiv.org/abs/2303.09797v2
- Date: Wed, 13 Dec 2023 11:34:54 GMT
- Title: MMFace4D: A Large-Scale Multi-Modal 4D Face Dataset for Audio-Driven 3D
Face Animation
- Authors: Haozhe Wu, Jia Jia, Junliang Xing, Hongwei Xu, Xiangyuan Wang, Jelo
Wang
- Abstract summary: We propose MMFace4D, a large-scale multi-modal 4D (3D sequence) face dataset consisting of 431 identities, 35,904 sequences, and 3.9 million frames.
MMFace4D exhibits two compelling characteristics: 1) a remarkably diverse set of subjects and corpus, encompassing actors spanning ages 15 to 68, and recorded sentences with durations ranging from 0.7 to 11.4 seconds.
We construct a non-autoregressive framework for audio-driven 3D face animation. Our framework considers the regional and composite natures of facial animations, and surpasses contemporary state-of-the-art approaches both qualitatively
- Score: 16.989858343787365
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-Driven Face Animation is an eagerly anticipated technique for
applications such as VR/AR, games, and movie making. With the rapid development
of 3D engines, there is an increasing demand for driving 3D faces with audio.
However, currently available 3D face animation datasets are either
scale-limited or quality-unsatisfied, which hampers further developments of
audio-driven 3D face animation. To address this challenge, we propose MMFace4D,
a large-scale multi-modal 4D (3D sequence) face dataset consisting of 431
identities, 35,904 sequences, and 3.9 million frames. MMFace4D exhibits two
compelling characteristics: 1) a remarkably diverse set of subjects and corpus,
encompassing actors spanning ages 15 to 68, and recorded sentences with
durations ranging from 0.7 to 11.4 seconds. 2) It features synchronized audio
and 3D mesh sequences with high-resolution face details. To capture the subtle
nuances of 3D facial expressions, we leverage three synchronized RGBD cameras
during the recording process. Upon MMFace4D, we construct a non-autoregressive
framework for audio-driven 3D face animation. Our framework considers the
regional and composite natures of facial animations, and surpasses contemporary
state-of-the-art approaches both qualitatively and quantitatively. The code,
model, and dataset will be publicly available.
Related papers
- MMHead: Towards Fine-grained Multi-modal 3D Facial Animation [68.04052669266174]
We construct a large-scale multi-modal 3D facial animation dataset, MMHead.
MMHead consists of 49 hours of 3D facial motion sequences, speech audios, and rich hierarchical text annotations.
Based on the MMHead dataset, we establish benchmarks for two new tasks: text-induced 3D talking head animation and text-to-3D facial motion generation.
arXiv Detail & Related papers (2024-10-10T09:37:01Z) - Media2Face: Co-speech Facial Animation Generation With Multi-Modality
Guidance [41.692420421029695]
We introduce an efficient variational auto-encoder mapping facial geometry and images to a highly generalized expression latent space.
We then use GNPFA to extract high-quality expressions and accurate head poses from a large array of videos.
We propose Media2Face, a diffusion model in GNPFA latent space for co-speech facial animation generation.
arXiv Detail & Related papers (2024-01-28T16:17:59Z) - Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis [88.17520303867099]
One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio.
We present Real3D-Potrait, a framework that improves the one-shot 3D reconstruction power with a large image-to-plane model.
Experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos.
arXiv Detail & Related papers (2024-01-16T17:04:30Z) - DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with
Diffusion [68.85904927374165]
We propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis.
It captures the complex one-to-many relationships between speech and 3D face based on diffusion.
It simultaneously achieves more realistic facial animation than the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-23T04:14:55Z) - Audio-Driven 3D Facial Animation from In-the-Wild Videos [16.76533748243908]
Given an arbitrary audio clip, audio-driven 3D facial animation aims to generate lifelike lip motions and facial expressions for a 3D head.
Existing methods typically rely on training their models using limited public 3D datasets that contain a restricted number of audio-3D scan pairs.
We propose a novel method that leverages in-the-wild 2D talking-head videos to train our 3D facial animation model.
arXiv Detail & Related papers (2023-06-20T13:53:05Z) - AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction [33.78412925549308]
3D face reconstruction from 2D images is an under-constrained problem due to the ambiguity of depth.
We propose AVFace that incorporates both modalities and accurately reconstructs the 4D facial and lip motion of any speaker.
arXiv Detail & Related papers (2023-04-25T19:41:10Z) - 3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head [13.305263646852087]
We introduce 3D-TalkEmo, a deep neural network that generates 3D talking head animation with various emotions.
We also create a large 3D dataset with synchronized audios and videos, rich corpus, as well as various emotion states of different persons.
arXiv Detail & Related papers (2021-04-25T02:48:19Z) - MeshTalk: 3D Face Animation from Speech using Cross-Modality
Disentanglement [142.9900055577252]
We propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face.
Our approach ensures highly accurate lip motion, while also plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion.
arXiv Detail & Related papers (2021-04-16T17:05:40Z) - Learning Speech-driven 3D Conversational Gestures from Video [106.15628979352738]
We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures.
Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures.
We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people.
arXiv Detail & Related papers (2021-02-13T01:05:39Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.