FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation
Synthesis Using Self-Supervised Speech Representation Learning
- URL: http://arxiv.org/abs/2303.05416v1
- Date: Thu, 9 Mar 2023 17:05:19 GMT
- Title: FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation
Synthesis Using Self-Supervised Speech Representation Learning
- Authors: Kazi Injamamul Haque and Zerrin Yumak
- Abstract summary: FaceXHuBERT is a text-less speech-driven 3D facial animation generation method.
It is very robust to background noise and can handle audio recorded in a variety of situations.
It produces superior results with respect to the realism of the animation 78% of the time.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents FaceXHuBERT, a text-less speech-driven 3D facial
animation generation method that allows to capture personalized and subtle cues
in speech (e.g. identity, emotion and hesitation). It is also very robust to
background noise and can handle audio recorded in a variety of situations (e.g.
multiple people speaking). Recent approaches employ end-to-end deep learning
taking into account both audio and text as input to generate facial animation
for the whole face. However, scarcity of publicly available expressive audio-3D
facial animation datasets poses a major bottleneck. The resulting animations
still have issues regarding accurate lip-synching, expressivity,
person-specific information and generalizability. We effectively employ
self-supervised pretrained HuBERT model in the training process that allows us
to incorporate both lexical and non-lexical information in the audio without
using a large lexicon. Additionally, guiding the training with a binary emotion
condition and speaker identity distinguishes the tiniest subtle facial motion.
We carried out extensive objective and subjective evaluation in comparison to
ground-truth and state-of-the-art work. A perceptual user study demonstrates
that our approach produces superior results with respect to the realism of the
animation 78% of the time in comparison to the state-of-the-art. In addition,
our method is 4 times faster eliminating the use of complex sequential models
such as transformers. We strongly recommend watching the supplementary video
before reading the paper. We also provide the implementation and evaluation
codes with a GitHub repository link.
Related papers
- 3DiFACE: Diffusion-based Speech-driven 3D Facial Animation and Editing [22.30870274645442]
We present 3DiFACE, a novel method for personalized speech-driven 3D facial animation and editing.
Our method outperforms existing state-of-the-art techniques and yields speech-driven animations with greater fidelity and diversity.
arXiv Detail & Related papers (2023-12-01T19:01:05Z) - DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D
Face Diffuser [12.576421368393113]
Speech-driven 3D facial animation has been an attractive task in academia and industry.
Recent approaches start to consider the non-deterministic fact of speech-driven 3D face animation and employ the diffusion model for the task.
We propose DiffusionTalker, a diffusion-based method that utilizes contrastive learning to personalize 3D facial animation and knowledge distillation to accelerate 3D animation generation.
arXiv Detail & Related papers (2023-11-28T07:13:20Z) - FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using
Diffusion [0.0]
We present FaceDiffuser, a non-deterministic deep learning model to generate speech-driven facial animations.
Our method is based on the diffusion technique and uses the pre-trained large speech representation model HuBERT to encode the audio input.
We also introduce a new in-house dataset that is based on a blendshape based rigged character.
arXiv Detail & Related papers (2023-09-20T13:33:00Z) - DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with
Diffusion [68.85904927374165]
We propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis.
It captures the complex one-to-many relationships between speech and 3D face based on diffusion.
It simultaneously achieves more realistic facial animation than the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-23T04:14:55Z) - Audio-Driven Talking Face Generation with Diverse yet Realistic Facial
Animations [61.65012981435094]
DIRFA is a novel method that can generate talking faces with diverse yet realistic facial animations from the same driving audio.
To accommodate fair variation of plausible facial animations for the same audio, we design a transformer-based probabilistic mapping network.
We show that DIRFA can generate talking faces with realistic facial animations effectively.
arXiv Detail & Related papers (2023-04-18T12:36:15Z) - Imitator: Personalized Speech-driven 3D Facial Animation [63.57811510502906]
State-of-the-art methods deform the face topology of the target actor to sync the input audio without considering the identity-specific speaking style and facial idiosyncrasies of the target actor.
We present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video.
We show that our approach produces temporally coherent facial expressions from input audio while preserving the speaking style of the target actors.
arXiv Detail & Related papers (2022-12-30T19:00:02Z) - Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation [46.8780140220063]
We present a joint audio-text model to capture contextual information for expressive speech-driven 3D facial animation.
Our hypothesis is that the text features can disambiguate the variations in upper face expressions, which are not strongly correlated with the audio.
We show that the combined acoustic and textual modalities can synthesize realistic facial expressions while maintaining audio-lip synchronization.
arXiv Detail & Related papers (2021-12-04T01:37:22Z) - MeshTalk: 3D Face Animation from Speech using Cross-Modality
Disentanglement [142.9900055577252]
We propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face.
Our approach ensures highly accurate lip motion, while also plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion.
arXiv Detail & Related papers (2021-04-16T17:05:40Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.