FaceFormer: Speech-Driven 3D Facial Animation with Transformers
- URL: http://arxiv.org/abs/2112.05329v1
- Date: Fri, 10 Dec 2021 04:21:59 GMT
- Title: FaceFormer: Speech-Driven 3D Facial Animation with Transformers
- Authors: Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, Taku Komura
- Abstract summary: Speech-driven 3D facial animation is challenging due to the complex geometry of human faces and the limited availability of 3D audio-visual data.
We propose a Transformer-based autoregressive model, FaceFormer, which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes.
- Score: 46.8780140220063
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech-driven 3D facial animation is challenging due to the complex geometry
of human faces and the limited availability of 3D audio-visual data. Prior
works typically focus on learning phoneme-level features of short audio windows
with limited context, occasionally resulting in inaccurate lip movements. To
tackle this limitation, we propose a Transformer-based autoregressive model,
FaceFormer, which encodes the long-term audio context and autoregressively
predicts a sequence of animated 3D face meshes. To cope with the data scarcity
issue, we integrate the self-supervised pre-trained speech representations.
Also, we devise two biased attention mechanisms well suited to this specific
task, including the biased cross-modal multi-head (MH) attention and the biased
causal MH self-attention with a periodic positional encoding strategy. The
former effectively aligns the audio-motion modalities, whereas the latter
offers abilities to generalize to longer audio sequences. Extensive experiments
and a perceptual user study show that our approach outperforms the existing
state-of-the-arts. The code will be made available.
Related papers
- EmoVOCA: Speech-Driven Emotional 3D Talking Heads [12.161006152509653]
We propose an innovative data-driven technique for creating a synthetic dataset, called EmoVOCA.
We then designed and trained an emotional 3D talking head generator that accepts a 3D face, an audio file, an emotion label, and an intensity value as inputs, and learns to animate the audio-synchronized lip movements with expressive traits of the face.
arXiv Detail & Related papers (2024-03-19T16:33:26Z) - SAiD: Speech-driven Blendshape Facial Animation with Diffusion [6.4271091365094515]
Speech-driven 3D facial animation is challenging due to the scarcity of large-scale visual-audio datasets.
We propose a speech-driven 3D facial animation with a diffusion model (SAiD), a lightweight Transformer-based U-Net with a cross-modality alignment bias between audio and visual to enhance lip synchronization.
arXiv Detail & Related papers (2023-12-25T04:40:32Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - PMMTalk: Speech-Driven 3D Facial Animation from Complementary Pseudo
Multi-modal Features [22.31865247379668]
Speech-driven 3D facial animation has improved a lot recently.
Most related works only utilize acoustic modality and neglect the influence of visual and textual cues.
We present a novel framework, namely PMMTalk, using complementary Pseudo Multi-Modal features for improving the accuracy of facial animation.
arXiv Detail & Related papers (2023-12-05T14:12:38Z) - DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D
Facial Animation [10.73030153404956]
We propose a cross-modal dual-learning framework, termed DualTalker, to improve data usage efficiency.
The framework is trained jointly with the primary task (audio-driven facial animation) and its dual task (lip reading) and shares common audio/motion encoder components.
Our approach outperforms current state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-11-08T15:39:56Z) - GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking
Face Generation [71.73912454164834]
A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency.
NeRF has become a popular technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video.
We propose GeneFace++ to handle these challenges by utilizing the rendering pitch contour as an auxiliary feature and introducing a temporal loss in the facial motion prediction process.
arXiv Detail & Related papers (2023-05-01T12:24:09Z) - Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model [96.24038430433885]
We propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face.
Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works.
arXiv Detail & Related papers (2021-03-29T09:09:39Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z) - Audio-driven Talking Face Video Generation with Learning-based
Personalized Head Pose [67.31838207805573]
We propose a deep neural network model that takes an audio signal A of a source person and a short video V of a target person as input.
We outputs a synthesized high-quality talking face video with personalized head pose.
Our method can generate high-quality talking face videos with more distinguishing head movement effects than state-of-the-art methods.
arXiv Detail & Related papers (2020-02-24T10:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.