Media2Face: Co-speech Facial Animation Generation With Multi-Modality
Guidance
- URL: http://arxiv.org/abs/2401.15687v2
- Date: Tue, 30 Jan 2024 08:23:23 GMT
- Title: Media2Face: Co-speech Facial Animation Generation With Multi-Modality
Guidance
- Authors: Qingcheng Zhao, Pengyu Long, Qixuan Zhang, Dafei Qin, Han Liang,
Longwen Zhang, Yingliang Zhang, Jingyi Yu, Lan Xu
- Abstract summary: We introduce an efficient variational auto-encoder mapping facial geometry and images to a highly generalized expression latent space.
We then use GNPFA to extract high-quality expressions and accurate head poses from a large array of videos.
We propose Media2Face, a diffusion model in GNPFA latent space for co-speech facial animation generation.
- Score: 41.692420421029695
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The synthesis of 3D facial animations from speech has garnered considerable
attention. Due to the scarcity of high-quality 4D facial data and
well-annotated abundant multi-modality labels, previous methods often suffer
from limited realism and a lack of lexible conditioning. We address this
challenge through a trilogy. We first introduce Generalized Neural Parametric
Facial Asset (GNPFA), an efficient variational auto-encoder mapping facial
geometry and images to a highly generalized expression latent space, decoupling
expressions and identities. Then, we utilize GNPFA to extract high-quality
expressions and accurate head poses from a large array of videos. This presents
the M2F-D dataset, a large, diverse, and scan-level co-speech 3D facial
animation dataset with well-annotated emotional and style labels. Finally, we
propose Media2Face, a diffusion model in GNPFA latent space for co-speech
facial animation generation, accepting rich multi-modality guidances from
audio, text, and image. Extensive experiments demonstrate that our model not
only achieves high fidelity in facial animation synthesis but also broadens the
scope of expressiveness and style adaptability in 3D facial animation.
Related papers
- Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial
Animation [41.489700112318864]
Speech-driven 3D facial animation aims to synthesize vivid facial animations that accurately synchronize with speech and match the unique speaking style.
We introduce an innovative speaking style disentanglement method, which enables arbitrary-subject speaking style encoding.
We also propose a novel framework called textbfMimic to learn disentangled representations of the speaking style and content from facial motions.
arXiv Detail & Related papers (2023-12-18T01:49:42Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - 3DiFACE: Diffusion-based Speech-driven 3D Facial Animation and Editing [22.30870274645442]
We present 3DiFACE, a novel method for personalized speech-driven 3D facial animation and editing.
Our method outperforms existing state-of-the-art techniques and yields speech-driven animations with greater fidelity and diversity.
arXiv Detail & Related papers (2023-12-01T19:01:05Z) - Personalized Speech-driven Expressive 3D Facial Animation Synthesis with
Style Control [1.8540152959438578]
A realistic facial animation system should consider such identity-specific speaking styles and facial idiosyncrasies to achieve high-degree of naturalness and plausibility.
We present a speech-driven expressive 3D facial animation synthesis framework that models identity specific facial motion as latent representations (called as styles)
Our framework is trained in an end-to-end fashion and has a non-autoregressive encoder-decoder architecture with three main components.
arXiv Detail & Related papers (2023-10-25T21:22:28Z) - DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with
Diffusion [68.85904927374165]
We propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis.
It captures the complex one-to-many relationships between speech and 3D face based on diffusion.
It simultaneously achieves more realistic facial animation than the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-23T04:14:55Z) - Audio-Driven Talking Face Generation with Diverse yet Realistic Facial
Animations [61.65012981435094]
DIRFA is a novel method that can generate talking faces with diverse yet realistic facial animations from the same driving audio.
To accommodate fair variation of plausible facial animations for the same audio, we design a transformer-based probabilistic mapping network.
We show that DIRFA can generate talking faces with realistic facial animations effectively.
arXiv Detail & Related papers (2023-04-18T12:36:15Z) - FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation
Synthesis Using Self-Supervised Speech Representation Learning [0.0]
FaceXHuBERT is a text-less speech-driven 3D facial animation generation method.
It is very robust to background noise and can handle audio recorded in a variety of situations.
It produces superior results with respect to the realism of the animation 78% of the time.
arXiv Detail & Related papers (2023-03-09T17:05:19Z) - MeshTalk: 3D Face Animation from Speech using Cross-Modality
Disentanglement [142.9900055577252]
We propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face.
Our approach ensures highly accurate lip motion, while also plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion.
arXiv Detail & Related papers (2021-04-16T17:05:40Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.