FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using
Diffusion
- URL: http://arxiv.org/abs/2309.11306v1
- Date: Wed, 20 Sep 2023 13:33:00 GMT
- Title: FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using
Diffusion
- Authors: Stefan Stan and Kazi Injamamul Haque and Zerrin Yumak
- Abstract summary: We present FaceDiffuser, a non-deterministic deep learning model to generate speech-driven facial animations.
Our method is based on the diffusion technique and uses the pre-trained large speech representation model HuBERT to encode the audio input.
We also introduce a new in-house dataset that is based on a blendshape based rigged character.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech-driven 3D facial animation synthesis has been a challenging task both
in industry and research. Recent methods mostly focus on deterministic deep
learning methods meaning that given a speech input, the output is always the
same. However, in reality, the non-verbal facial cues that reside throughout
the face are non-deterministic in nature. In addition, majority of the
approaches focus on 3D vertex based datasets and methods that are compatible
with existing facial animation pipelines with rigged characters is scarce. To
eliminate these issues, we present FaceDiffuser, a non-deterministic deep
learning model to generate speech-driven facial animations that is trained with
both 3D vertex and blendshape based datasets. Our method is based on the
diffusion technique and uses the pre-trained large speech representation model
HuBERT to encode the audio input. To the best of our knowledge, we are the
first to employ the diffusion method for the task of speech-driven 3D facial
animation synthesis. We have run extensive objective and subjective analyses
and show that our approach achieves better or comparable results in comparison
to the state-of-the-art methods. We also introduce a new in-house dataset that
is based on a blendshape based rigged character. We recommend watching the
accompanying supplementary video. The code and the dataset will be publicly
available.
Related papers
- 3DiFACE: Diffusion-based Speech-driven 3D Facial Animation and Editing [22.30870274645442]
We present 3DiFACE, a novel method for personalized speech-driven 3D facial animation and editing.
Our method outperforms existing state-of-the-art techniques and yields speech-driven animations with greater fidelity and diversity.
arXiv Detail & Related papers (2023-12-01T19:01:05Z) - DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D
Face Diffuser [12.576421368393113]
Speech-driven 3D facial animation has been an attractive task in academia and industry.
Recent approaches start to consider the non-deterministic fact of speech-driven 3D face animation and employ the diffusion model for the task.
We propose DiffusionTalker, a diffusion-based method that utilizes contrastive learning to personalize 3D facial animation and knowledge distillation to accelerate 3D animation generation.
arXiv Detail & Related papers (2023-11-28T07:13:20Z) - DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models [24.401443462720135]
We propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder.
In particular, our style includes the generation of head poses, thereby enhancing user perception.
We address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset.
arXiv Detail & Related papers (2023-09-30T17:01:18Z) - DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with
Diffusion [68.85904927374165]
We propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis.
It captures the complex one-to-many relationships between speech and 3D face based on diffusion.
It simultaneously achieves more realistic facial animation than the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-23T04:14:55Z) - FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation
Synthesis Using Self-Supervised Speech Representation Learning [0.0]
FaceXHuBERT is a text-less speech-driven 3D facial animation generation method.
It is very robust to background noise and can handle audio recorded in a variety of situations.
It produces superior results with respect to the realism of the animation 78% of the time.
arXiv Detail & Related papers (2023-03-09T17:05:19Z) - Pose-Controllable 3D Facial Animation Synthesis using Hierarchical
Audio-Vertex Attention [52.63080543011595]
A novel pose-controllable 3D facial animation synthesis method is proposed by utilizing hierarchical audio-vertex attention.
The proposed method can produce more realistic facial expressions and head posture movements.
arXiv Detail & Related papers (2023-02-24T09:36:31Z) - Imitator: Personalized Speech-driven 3D Facial Animation [63.57811510502906]
State-of-the-art methods deform the face topology of the target actor to sync the input audio without considering the identity-specific speaking style and facial idiosyncrasies of the target actor.
We present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video.
We show that our approach produces temporally coherent facial expressions from input audio while preserving the speaking style of the target actors.
arXiv Detail & Related papers (2022-12-30T19:00:02Z) - MeshTalk: 3D Face Animation from Speech using Cross-Modality
Disentanglement [142.9900055577252]
We propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face.
Our approach ensures highly accurate lip motion, while also plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion.
arXiv Detail & Related papers (2021-04-16T17:05:40Z) - Learning Speech-driven 3D Conversational Gestures from Video [106.15628979352738]
We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures.
Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures.
We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people.
arXiv Detail & Related papers (2021-02-13T01:05:39Z) - Audio-driven Talking Face Video Generation with Learning-based
Personalized Head Pose [67.31838207805573]
We propose a deep neural network model that takes an audio signal A of a source person and a short video V of a target person as input.
We outputs a synthesized high-quality talking face video with personalized head pose.
Our method can generate high-quality talking face videos with more distinguishing head movement effects than state-of-the-art methods.
arXiv Detail & Related papers (2020-02-24T10:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.