PMMTalk: Speech-Driven 3D Facial Animation from Complementary Pseudo
Multi-modal Features
- URL: http://arxiv.org/abs/2312.02781v1
- Date: Tue, 5 Dec 2023 14:12:38 GMT
- Title: PMMTalk: Speech-Driven 3D Facial Animation from Complementary Pseudo
Multi-modal Features
- Authors: Tianshun Han, Shengnan Gui, Yiqing Huang, Baihui Li, Lijian Liu,
Benjia Zhou, Ning Jiang, Quan Lu, Ruicong Zhi, Yanyan Liang, Du Zhang, Jun
Wan
- Abstract summary: Speech-driven 3D facial animation has improved a lot recently.
Most related works only utilize acoustic modality and neglect the influence of visual and textual cues.
We present a novel framework, namely PMMTalk, using complementary Pseudo Multi-Modal features for improving the accuracy of facial animation.
- Score: 22.31865247379668
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech-driven 3D facial animation has improved a lot recently while most
related works only utilize acoustic modality and neglect the influence of
visual and textual cues, leading to unsatisfactory results in terms of
precision and coherence. We argue that visual and textual cues are not trivial
information. Therefore, we present a novel framework, namely PMMTalk, using
complementary Pseudo Multi-Modal features for improving the accuracy of facial
animation. The framework entails three modules: PMMTalk encoder, cross-modal
alignment module, and PMMTalk decoder. Specifically, the PMMTalk encoder
employs the off-the-shelf talking head generation architecture and speech
recognition technology to extract visual and textual information from speech,
respectively. Subsequently, the cross-modal alignment module aligns the
audio-image-text features at temporal and semantic levels. Then PMMTalk decoder
is employed to predict lip-syncing facial blendshape coefficients. Contrary to
prior methods, PMMTalk only requires an additional random reference face image
but yields more accurate results. Additionally, it is artist-friendly as it
seamlessly integrates into standard animation production workflows by
introducing facial blendshape coefficients. Finally, given the scarcity of 3D
talking face datasets, we introduce a large-scale 3D Chinese Audio-Visual
Facial Animation (3D-CAVFA) dataset. Extensive experiments and user studies
show that our approach outperforms the state of the art. We recommend watching
the supplementary video.
Related papers
- MMHead: Towards Fine-grained Multi-modal 3D Facial Animation [68.04052669266174]
We construct a large-scale multi-modal 3D facial animation dataset, MMHead.
MMHead consists of 49 hours of 3D facial motion sequences, speech audios, and rich hierarchical text annotations.
Based on the MMHead dataset, we establish benchmarks for two new tasks: text-induced 3D talking head animation and text-to-3D facial motion generation.
arXiv Detail & Related papers (2024-10-10T09:37:01Z) - SAiD: Speech-driven Blendshape Facial Animation with Diffusion [6.4271091365094515]
Speech-driven 3D facial animation is challenging due to the scarcity of large-scale visual-audio datasets.
We propose a speech-driven 3D facial animation with a diffusion model (SAiD), a lightweight Transformer-based U-Net with a cross-modality alignment bias between audio and visual to enhance lip synchronization.
arXiv Detail & Related papers (2023-12-25T04:40:32Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models [24.401443462720135]
We propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder.
In particular, our style includes the generation of head poses, thereby enhancing user perception.
We address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset.
arXiv Detail & Related papers (2023-09-30T17:01:18Z) - SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend
3D Talking Faces [28.40393487247833]
Speech-driven 3D face animation technique, extending its applications to various multimedia fields.
Previous research has generated promising realistic lip movements and facial expressions from audio signals.
We propose a novel framework SelfTalk, by involving self-supervision in a cross-modals network system to learn 3D talking faces.
arXiv Detail & Related papers (2023-06-19T09:39:10Z) - FaceFormer: Speech-Driven 3D Facial Animation with Transformers [46.8780140220063]
Speech-driven 3D facial animation is challenging due to the complex geometry of human faces and the limited availability of 3D audio-visual data.
We propose a Transformer-based autoregressive model, FaceFormer, which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes.
arXiv Detail & Related papers (2021-12-10T04:21:59Z) - Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model [96.24038430433885]
We propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face.
Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works.
arXiv Detail & Related papers (2021-03-29T09:09:39Z) - Learning Speech-driven 3D Conversational Gestures from Video [106.15628979352738]
We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures.
Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures.
We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people.
arXiv Detail & Related papers (2021-02-13T01:05:39Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.