Media2Face: Co-speech Facial Animation Generation With Multi-Modality
Guidance
- URL: http://arxiv.org/abs/2401.15687v2
- Date: Tue, 30 Jan 2024 08:23:23 GMT
- Title: Media2Face: Co-speech Facial Animation Generation With Multi-Modality
Guidance
- Authors: Qingcheng Zhao, Pengyu Long, Qixuan Zhang, Dafei Qin, Han Liang,
Longwen Zhang, Yingliang Zhang, Jingyi Yu, Lan Xu
- Abstract summary: We introduce an efficient variational auto-encoder mapping facial geometry and images to a highly generalized expression latent space.
We then use GNPFA to extract high-quality expressions and accurate head poses from a large array of videos.
We propose Media2Face, a diffusion model in GNPFA latent space for co-speech facial animation generation.
- Score: 41.692420421029695
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The synthesis of 3D facial animations from speech has garnered considerable
attention. Due to the scarcity of high-quality 4D facial data and
well-annotated abundant multi-modality labels, previous methods often suffer
from limited realism and a lack of lexible conditioning. We address this
challenge through a trilogy. We first introduce Generalized Neural Parametric
Facial Asset (GNPFA), an efficient variational auto-encoder mapping facial
geometry and images to a highly generalized expression latent space, decoupling
expressions and identities. Then, we utilize GNPFA to extract high-quality
expressions and accurate head poses from a large array of videos. This presents
the M2F-D dataset, a large, diverse, and scan-level co-speech 3D facial
animation dataset with well-annotated emotional and style labels. Finally, we
propose Media2Face, a diffusion model in GNPFA latent space for co-speech
facial animation generation, accepting rich multi-modality guidances from
audio, text, and image. Extensive experiments demonstrate that our model not
only achieves high fidelity in facial animation synthesis but also broadens the
scope of expressiveness and style adaptability in 3D facial animation.
Related papers
- JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation [10.003794924759765]
JoyVASA is a diffusion-based method for generating facial dynamics and head motion in audio-driven facial animation.
We introduce a decoupled facial representation framework that separates dynamic facial expressions from static 3D facial representations.
In the second stage, a diffusion transformer is trained to generate motion sequences directly from audio cues, independent of character identity.
arXiv Detail & Related papers (2024-11-14T06:13:05Z) - MMHead: Towards Fine-grained Multi-modal 3D Facial Animation [68.04052669266174]
We construct a large-scale multi-modal 3D facial animation dataset, MMHead.
MMHead consists of 49 hours of 3D facial motion sequences, speech audios, and rich hierarchical text annotations.
Based on the MMHead dataset, we establish benchmarks for two new tasks: text-induced 3D talking head animation and text-to-3D facial motion generation.
arXiv Detail & Related papers (2024-10-10T09:37:01Z) - Personalized Speech-driven Expressive 3D Facial Animation Synthesis with
Style Control [1.8540152959438578]
A realistic facial animation system should consider such identity-specific speaking styles and facial idiosyncrasies to achieve high-degree of naturalness and plausibility.
We present a speech-driven expressive 3D facial animation synthesis framework that models identity specific facial motion as latent representations (called as styles)
Our framework is trained in an end-to-end fashion and has a non-autoregressive encoder-decoder architecture with three main components.
arXiv Detail & Related papers (2023-10-25T21:22:28Z) - DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with
Diffusion [68.85904927374165]
We propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis.
It captures the complex one-to-many relationships between speech and 3D face based on diffusion.
It simultaneously achieves more realistic facial animation than the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-23T04:14:55Z) - Audio-Driven Talking Face Generation with Diverse yet Realistic Facial
Animations [61.65012981435094]
DIRFA is a novel method that can generate talking faces with diverse yet realistic facial animations from the same driving audio.
To accommodate fair variation of plausible facial animations for the same audio, we design a transformer-based probabilistic mapping network.
We show that DIRFA can generate talking faces with realistic facial animations effectively.
arXiv Detail & Related papers (2023-04-18T12:36:15Z) - FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation
Synthesis Using Self-Supervised Speech Representation Learning [0.0]
FaceXHuBERT is a text-less speech-driven 3D facial animation generation method.
It is very robust to background noise and can handle audio recorded in a variety of situations.
It produces superior results with respect to the realism of the animation 78% of the time.
arXiv Detail & Related papers (2023-03-09T17:05:19Z) - MeshTalk: 3D Face Animation from Speech using Cross-Modality
Disentanglement [142.9900055577252]
We propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face.
Our approach ensures highly accurate lip motion, while also plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion.
arXiv Detail & Related papers (2021-04-16T17:05:40Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.