Related papers: Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance

Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance

URL: http://arxiv.org/abs/2401.15687v2
Date: Tue, 30 Jan 2024 08:23:23 GMT
Title: Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance
Authors: Qingcheng Zhao, Pengyu Long, Qixuan Zhang, Dafei Qin, Han Liang, Longwen Zhang, Yingliang Zhang, Jingyi Yu, Lan Xu
Abstract summary: We introduce an efficient variational auto-encoder mapping facial geometry and images to a highly generalized expression latent space. We then use GNPFA to extract high-quality expressions and accurate head poses from a large array of videos. We propose Media2Face, a diffusion model in GNPFA latent space for co-speech facial animation generation.
Score: 41.692420421029695
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The synthesis of 3D facial animations from speech has garnered considerable attention. Due to the scarcity of high-quality 4D facial data and well-annotated abundant multi-modality labels, previous methods often suffer from limited realism and a lack of lexible conditioning. We address this challenge through a trilogy. We first introduce Generalized Neural Parametric Facial Asset (GNPFA), an efficient variational auto-encoder mapping facial geometry and images to a highly generalized expression latent space, decoupling expressions and identities. Then, we utilize GNPFA to extract high-quality expressions and accurate head poses from a large array of videos. This presents the M2F-D dataset, a large, diverse, and scan-level co-speech 3D facial animation dataset with well-annotated emotional and style labels. Finally, we propose Media2Face, a diffusion model in GNPFA latent space for co-speech facial animation generation, accepting rich multi-modality guidances from audio, text, and image. Extensive experiments demonstrate that our model not only achieves high fidelity in facial animation synthesis but also broadens the scope of expressiveness and style adaptability in 3D facial animation.

Related papers

Learning Disentangled Speech- and Expression-Driven Blendshapes for 3D Talking Face Animation [20.91704034858042]
We model facial animation driven by both speech and emotion as a linear additive problem.<n>We learn a set of blendshapes driven by speech and emotion that can be mapped to the expression and jaw pose parameters of the FLAME model.<n>Our approach achieves superior emotional expressivity compared to existing methods, without compromising lip-sync quality.
arXiv Detail & Related papers (2025-10-29T07:29:21Z)
EmoDiffusion: Enhancing Emotional 3D Facial Animation with Latent Diffusion Models [66.67979602235015]
EmoDiffusion is a novel approach that disentangles different emotions in speech to generate rich 3D emotional facial expressions. We capture facial expressions under the guidance of animation experts using LiveLinkFace on an iPhone.
arXiv Detail & Related papers (2025-03-14T02:54:22Z)
JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation [10.003794924759765]
JoyVASA is a diffusion-based method for generating facial dynamics and head motion in audio-driven facial animation. We introduce a decoupled facial representation framework that separates dynamic facial expressions from static 3D facial representations. In the second stage, a diffusion transformer is trained to generate motion sequences directly from audio cues, independent of character identity.
arXiv Detail & Related papers (2024-11-14T06:13:05Z)
MMHead: Towards Fine-grained Multi-modal 3D Facial Animation [68.04052669266174]
We construct a large-scale multi-modal 3D facial animation dataset, MMHead. MMHead consists of 49 hours of 3D facial motion sequences, speech audios, and rich hierarchical text annotations. Based on the MMHead dataset, we establish benchmarks for two new tasks: text-induced 3D talking head animation and text-to-3D facial motion generation.
arXiv Detail & Related papers (2024-10-10T09:37:01Z)
Personalized Speech-driven Expressive 3D Facial Animation Synthesis with Style Control [1.8540152959438578]
A realistic facial animation system should consider such identity-specific speaking styles and facial idiosyncrasies to achieve high-degree of naturalness and plausibility. We present a speech-driven expressive 3D facial animation synthesis framework that models identity specific facial motion as latent representations (called as styles) Our framework is trained in an end-to-end fashion and has a non-autoregressive encoder-decoder architecture with three main components.
arXiv Detail & Related papers (2023-10-25T21:22:28Z)
DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion [68.85904927374165]
We propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis. It captures the complex one-to-many relationships between speech and 3D face based on diffusion. It simultaneously achieves more realistic facial animation than the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-23T04:14:55Z)
Audio-Driven Talking Face Generation with Diverse yet Realistic Facial Animations [61.65012981435094]
DIRFA is a novel method that can generate talking faces with diverse yet realistic facial animations from the same driving audio. To accommodate fair variation of plausible facial animations for the same audio, we design a transformer-based probabilistic mapping network. We show that DIRFA can generate talking faces with realistic facial animations effectively.
arXiv Detail & Related papers (2023-04-18T12:36:15Z)
FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning [0.0]
FaceXHuBERT is a text-less speech-driven 3D facial animation generation method. It is very robust to background noise and can handle audio recorded in a variety of situations. It produces superior results with respect to the realism of the animation 78% of the time.
arXiv Detail & Related papers (2023-03-09T17:05:19Z)
MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement [142.9900055577252]
We propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face. Our approach ensures highly accurate lip motion, while also plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion.
arXiv Detail & Related papers (2021-04-16T17:05:40Z)
Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking. Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.