DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with
Diffusion
- URL: http://arxiv.org/abs/2310.05934v1
- Date: Wed, 23 Aug 2023 04:14:55 GMT
- Title: DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with
Diffusion
- Authors: Se Jin Park, Joanna Hong, Minsu Kim, Yong Man Ro
- Abstract summary: We propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis.
It captures the complex one-to-many relationships between speech and 3D face based on diffusion.
It simultaneously achieves more realistic facial animation than the state-of-the-art methods.
- Score: 68.85904927374165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech-driven 3D facial animation has gained significant attention for its
ability to create realistic and expressive facial animations in 3D space based
on speech. Learning-based methods have shown promising progress in achieving
accurate facial motion synchronized with speech. However, one-to-many nature of
speech-to-3D facial synthesis has not been fully explored: while the lip
accurately synchronizes with the speech content, other facial attributes beyond
speech-related motions are variable with respect to the speech. To account for
the potential variance in the facial attributes within a single speech, we
propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis.
DF-3DFace captures the complex one-to-many relationships between speech and 3D
face based on diffusion. It concurrently achieves aligned lip motion by
exploiting audio-mesh synchronization and masked conditioning. Furthermore, the
proposed method jointly models identity and pose in addition to facial motions
so that it can generate 3D face animation without requiring a reference
identity mesh and produce natural head poses. We contribute a new large-scale
3D facial mesh dataset, 3D-HDTF to enable the synthesis of variations in
identities, poses, and facial motions of 3D face mesh. Extensive experiments
demonstrate that our method successfully generates highly variable facial
shapes and motions from speech and simultaneously achieves more realistic
facial animation than the state-of-the-art methods.
Related papers
- MMHead: Towards Fine-grained Multi-modal 3D Facial Animation [68.04052669266174]
We construct a large-scale multi-modal 3D facial animation dataset, MMHead.
MMHead consists of 49 hours of 3D facial motion sequences, speech audios, and rich hierarchical text annotations.
Based on the MMHead dataset, we establish benchmarks for two new tasks: text-induced 3D talking head animation and text-to-3D facial motion generation.
arXiv Detail & Related papers (2024-10-10T09:37:01Z) - 3DFacePolicy: Speech-Driven 3D Facial Animation with Diffusion Policy [1.3499500088995464]
We propose 3DFacePolicy, a diffusion policy model for 3D facial animation prediction.
Method generates variable and realistic human facial movements.
Experiments show that our approach is effective in variable and dynamic facial motion.
arXiv Detail & Related papers (2024-09-17T02:30:34Z) - Media2Face: Co-speech Facial Animation Generation With Multi-Modality
Guidance [41.692420421029695]
We introduce an efficient variational auto-encoder mapping facial geometry and images to a highly generalized expression latent space.
We then use GNPFA to extract high-quality expressions and accurate head poses from a large array of videos.
We propose Media2Face, a diffusion model in GNPFA latent space for co-speech facial animation generation.
arXiv Detail & Related papers (2024-01-28T16:17:59Z) - 3DiFACE: Diffusion-based Speech-driven 3D Facial Animation and Editing [22.30870274645442]
We present 3DiFACE, a novel method for personalized speech-driven 3D facial animation and editing.
Our method outperforms existing state-of-the-art techniques and yields speech-driven animations with greater fidelity and diversity.
arXiv Detail & Related papers (2023-12-01T19:01:05Z) - Breathing Life into Faces: Speech-driven 3D Facial Animation with
Natural Head Pose and Detailed Shape [19.431264557873117]
We introduce VividTalker, a new framework designed to facilitate speech-driven 3D facial animation.
We explicitly disentangle facial animation into head pose and mouth movement and encode them separately.
We construct a new 3D dataset with detailed shapes and learn to synthesize facial details in line with speech content.
arXiv Detail & Related papers (2023-10-31T07:47:19Z) - FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using
Diffusion [0.0]
We present FaceDiffuser, a non-deterministic deep learning model to generate speech-driven facial animations.
Our method is based on the diffusion technique and uses the pre-trained large speech representation model HuBERT to encode the audio input.
We also introduce a new in-house dataset that is based on a blendshape based rigged character.
arXiv Detail & Related papers (2023-09-20T13:33:00Z) - Audio-Driven Talking Face Generation with Diverse yet Realistic Facial
Animations [61.65012981435094]
DIRFA is a novel method that can generate talking faces with diverse yet realistic facial animations from the same driving audio.
To accommodate fair variation of plausible facial animations for the same audio, we design a transformer-based probabilistic mapping network.
We show that DIRFA can generate talking faces with realistic facial animations effectively.
arXiv Detail & Related papers (2023-04-18T12:36:15Z) - Pose-Controllable 3D Facial Animation Synthesis using Hierarchical
Audio-Vertex Attention [52.63080543011595]
A novel pose-controllable 3D facial animation synthesis method is proposed by utilizing hierarchical audio-vertex attention.
The proposed method can produce more realistic facial expressions and head posture movements.
arXiv Detail & Related papers (2023-02-24T09:36:31Z) - MeshTalk: 3D Face Animation from Speech using Cross-Modality
Disentanglement [142.9900055577252]
We propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face.
Our approach ensures highly accurate lip motion, while also plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion.
arXiv Detail & Related papers (2021-04-16T17:05:40Z) - Learning Speech-driven 3D Conversational Gestures from Video [106.15628979352738]
We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures.
Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures.
We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people.
arXiv Detail & Related papers (2021-02-13T01:05:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.