FaceFormer: Speech-Driven 3D Facial Animation with Transformers
- URL: http://arxiv.org/abs/2112.05329v1
- Date: Fri, 10 Dec 2021 04:21:59 GMT
- Title: FaceFormer: Speech-Driven 3D Facial Animation with Transformers
- Authors: Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, Taku Komura
- Abstract summary: Speech-driven 3D facial animation is challenging due to the complex geometry of human faces and the limited availability of 3D audio-visual data.
We propose a Transformer-based autoregressive model, FaceFormer, which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes.
- Score: 46.8780140220063
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech-driven 3D facial animation is challenging due to the complex geometry
of human faces and the limited availability of 3D audio-visual data. Prior
works typically focus on learning phoneme-level features of short audio windows
with limited context, occasionally resulting in inaccurate lip movements. To
tackle this limitation, we propose a Transformer-based autoregressive model,
FaceFormer, which encodes the long-term audio context and autoregressively
predicts a sequence of animated 3D face meshes. To cope with the data scarcity
issue, we integrate the self-supervised pre-trained speech representations.
Also, we devise two biased attention mechanisms well suited to this specific
task, including the biased cross-modal multi-head (MH) attention and the biased
causal MH self-attention with a periodic positional encoding strategy. The
former effectively aligns the audio-motion modalities, whereas the latter
offers abilities to generalize to longer audio sequences. Extensive experiments
and a perceptual user study show that our approach outperforms the existing
state-of-the-arts. The code will be made available.
Related papers
- JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation [10.003794924759765]
JoyVASA is a diffusion-based method for generating facial dynamics and head motion in audio-driven facial animation.
We introduce a decoupled facial representation framework that separates dynamic facial expressions from static 3D facial representations.
In the second stage, a diffusion transformer is trained to generate motion sequences directly from audio cues, independent of character identity.
arXiv Detail & Related papers (2024-11-14T06:13:05Z) - MMHead: Towards Fine-grained Multi-modal 3D Facial Animation [68.04052669266174]
We construct a large-scale multi-modal 3D facial animation dataset, MMHead.
MMHead consists of 49 hours of 3D facial motion sequences, speech audios, and rich hierarchical text annotations.
Based on the MMHead dataset, we establish benchmarks for two new tasks: text-induced 3D talking head animation and text-to-3D facial motion generation.
arXiv Detail & Related papers (2024-10-10T09:37:01Z) - KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding [19.15471840100407]
We present a novel approach for synthesizing 3D facial motions from audio sequences using key motion embeddings.
Our method integrates linguistic and data-driven priors through two modules: the linguistic-based key motion acquisition and the cross-modal motion completion.
The latter extends key motions into a full sequence of 3D talking faces guided by audio features, improving temporal coherence and audio-visual consistency.
arXiv Detail & Related papers (2024-09-02T09:41:24Z) - SAiD: Speech-driven Blendshape Facial Animation with Diffusion [6.4271091365094515]
Speech-driven 3D facial animation is challenging due to the scarcity of large-scale visual-audio datasets.
We propose a speech-driven 3D facial animation with a diffusion model (SAiD), a lightweight Transformer-based U-Net with a cross-modality alignment bias between audio and visual to enhance lip synchronization.
arXiv Detail & Related papers (2023-12-25T04:40:32Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - PMMTalk: Speech-Driven 3D Facial Animation from Complementary Pseudo
Multi-modal Features [22.31865247379668]
Speech-driven 3D facial animation has improved a lot recently.
Most related works only utilize acoustic modality and neglect the influence of visual and textual cues.
We present a novel framework, namely PMMTalk, using complementary Pseudo Multi-Modal features for improving the accuracy of facial animation.
arXiv Detail & Related papers (2023-12-05T14:12:38Z) - GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking
Face Generation [71.73912454164834]
A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency.
NeRF has become a popular technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video.
We propose GeneFace++ to handle these challenges by utilizing the rendering pitch contour as an auxiliary feature and introducing a temporal loss in the facial motion prediction process.
arXiv Detail & Related papers (2023-05-01T12:24:09Z) - Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model [96.24038430433885]
We propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face.
Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works.
arXiv Detail & Related papers (2021-03-29T09:09:39Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.