CorrTalk: Correlation Between Hierarchical Speech and Facial Activity
Variances for 3D Animation
- URL: http://arxiv.org/abs/2310.11295v1
- Date: Tue, 17 Oct 2023 14:16:42 GMT
- Title: CorrTalk: Correlation Between Hierarchical Speech and Facial Activity
Variances for 3D Animation
- Authors: Zhaojie Chu, Kailing Guo, Xiaofen Xing, Yilin Lan, Bolun Cai, and
Xiangmin Xu
- Abstract summary: Speech-driven 3D facial animation is a challenging cross-modal task that has attracted growing research interest.
Existing approaches often simplify the process by directly mapping single-level speech features to the entire facial animation.
We propose a novel framework, CorrTalk, which effectively establishes the temporal correlation between hierarchical speech features and facial activities.
- Score: 12.178057082024214
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech-driven 3D facial animation is a challenging cross-modal task that has
attracted growing research interest. During speaking activities, the mouth
displays strong motions, while the other facial regions typically demonstrate
comparatively weak activity levels. Existing approaches often simplify the
process by directly mapping single-level speech features to the entire facial
animation, which overlook the differences in facial activity intensity leading
to overly smoothed facial movements. In this study, we propose a novel
framework, CorrTalk, which effectively establishes the temporal correlation
between hierarchical speech features and facial activities of different
intensities across distinct regions. A novel facial activity intensity metric
is defined to distinguish between strong and weak facial activity, obtained by
computing the short-time Fourier transform of facial vertex displacements.
Based on the variances in facial activity, we propose a dual-branch decoding
framework to synchronously synthesize strong and weak facial activity, which
guarantees wider intensity facial animation synthesis. Furthermore, a weighted
hierarchical feature encoder is proposed to establish temporal correlation
between hierarchical speech features and facial activity at different
intensities, which ensures lip-sync and plausible facial expressions. Extensive
qualitatively and quantitatively experiments as well as a user study indicate
that our CorrTalk outperforms existing state-of-the-art methods. The source
code and supplementary video are publicly available at:
https://zjchu.github.io/projects/CorrTalk/
Related papers
- KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding [19.15471840100407]
We present a novel approach for synthesizing 3D facial motions from audio sequences using key motion embeddings.
Our method integrates linguistic and data-driven priors through two modules: the linguistic-based key motion acquisition and the cross-modal motion completion.
The latter extends key motions into a full sequence of 3D talking faces guided by audio features, improving temporal coherence and audio-visual consistency.
arXiv Detail & Related papers (2024-09-02T09:41:24Z) - High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation.
We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw.
Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z) - RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework.
In the first component, we consider both identity and intra-personal variation features related to speaking lip movements.
In the second component, we design a lightweight facial identity alignment (FIA) module.
This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z) - FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio.
Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency.
We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z) - DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D
Facial Animation [10.73030153404956]
We propose a cross-modal dual-learning framework, termed DualTalker, to improve data usage efficiency.
The framework is trained jointly with the primary task (audio-driven facial animation) and its dual task (lip reading) and shares common audio/motion encoder components.
Our approach outperforms current state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-11-08T15:39:56Z) - DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with
Diffusion [68.85904927374165]
We propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis.
It captures the complex one-to-many relationships between speech and 3D face based on diffusion.
It simultaneously achieves more realistic facial animation than the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-23T04:14:55Z) - Speech-Driven 3D Face Animation with Composite and Regional Facial
Movements [30.348768852726295]
Speech-driven 3D face animation poses significant challenges due to the intricacy and variability inherent in human facial movements.
This paper emphasizes the importance of considering both the composite and regional natures of facial movements in speech-driven 3D face animation.
arXiv Detail & Related papers (2023-08-10T08:42:20Z) - Audio-Driven Talking Face Generation with Diverse yet Realistic Facial
Animations [61.65012981435094]
DIRFA is a novel method that can generate talking faces with diverse yet realistic facial animations from the same driving audio.
To accommodate fair variation of plausible facial animations for the same audio, we design a transformer-based probabilistic mapping network.
We show that DIRFA can generate talking faces with realistic facial animations effectively.
arXiv Detail & Related papers (2023-04-18T12:36:15Z) - Pose-Controllable 3D Facial Animation Synthesis using Hierarchical
Audio-Vertex Attention [52.63080543011595]
A novel pose-controllable 3D facial animation synthesis method is proposed by utilizing hierarchical audio-vertex attention.
The proposed method can produce more realistic facial expressions and head posture movements.
arXiv Detail & Related papers (2023-02-24T09:36:31Z) - CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior [27.989344587876964]
Speech-driven 3D facial animation has been widely studied, yet there is still a gap to achieving realism and vividness.
We propose to cast speech-driven facial animation as a code query task in a finite proxy space of the learned codebook.
We demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-01-06T05:04:32Z) - MeshTalk: 3D Face Animation from Speech using Cross-Modality
Disentanglement [142.9900055577252]
We propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face.
Our approach ensures highly accurate lip motion, while also plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion.
arXiv Detail & Related papers (2021-04-16T17:05:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.