Related papers: KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding

KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding

URL: http://arxiv.org/abs/2409.01113v1
Date: Mon, 2 Sep 2024 09:41:24 GMT
Title: KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding
Authors: Zhihao Xu, Shengjie Gong, Jiapeng Tang, Lingyu Liang, Yining Huang, Haojie Li, Shuangping Huang,
Abstract summary: We present a novel approach for synthesizing 3D facial motions from audio sequences using key motion embeddings. Our method integrates linguistic and data-driven priors through two modules: the linguistic-based key motion acquisition and the cross-modal motion completion. The latter extends key motions into a full sequence of 3D talking faces guided by audio features, improving temporal coherence and audio-visual consistency.
Score: 19.15471840100407
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a novel approach for synthesizing 3D facial motions from audio sequences using key motion embeddings. Despite recent advancements in data-driven techniques, accurately mapping between audio signals and 3D facial meshes remains challenging. Direct regression of the entire sequence often leads to over-smoothed results due to the ill-posed nature of the problem. To this end, we propose a progressive learning mechanism that generates 3D facial animations by introducing key motion capture to decrease cross-modal mapping uncertainty and learning complexity. Concretely, our method integrates linguistic and data-driven priors through two modules: the linguistic-based key motion acquisition and the cross-modal motion completion. The former identifies key motions and learns the associated 3D facial expressions, ensuring accurate lip-speech synchronization. The latter extends key motions into a full sequence of 3D talking faces guided by audio features, improving temporal coherence and audio-visual consistency. Extensive experimental comparisons against existing state-of-the-art methods demonstrate the superiority of our approach in generating more vivid and consistent talking face animations. Consistent enhancements in results through the integration of our proposed learning scheme with existing methods underscore the efficacy of our approach. Our code and weights will be at the project website: \url{https://github.com/ffxzh/KMTalk}.

Related papers

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model [73.30619724574642]
Speech-driven 3D facial animation aims to generate realistic and synchronized facial motions driven by speech inputs.<n>Recent methods have employed audio-conditioned diffusion models for 3D facial animation.<n>We propose a novel autoregressive diffusion model that processes audio in a streaming manner.
arXiv Detail & Related papers (2025-11-18T07:55:16Z)
Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering [53.2204901422631]
Text2Lip is a viseme-centric framework that constructs an interpretable phonetic-visual bridge.<n>We show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness.
arXiv Detail & Related papers (2025-08-04T12:50:22Z)
OT-Talk: Animating 3D Talking Head with Optimal Transportation [20.023346831300373]
OT-Talk is the first approach to leverage optimal transportation to optimize the learning model in talking head animation.<n>Building on existing learning frameworks, we utilize a pre-trained Hubert model to extract audio features and a transformer model to process temporal sequences.<n>Our experiments on two public audio-mesh datasets demonstrate that our method outperforms state-of-the-art techniques.
arXiv Detail & Related papers (2025-05-03T21:49:23Z)
3DFacePolicy: Audio-Driven 3D Facial Animation Based on Action Control [2.3767676641636584]
We propose 3DFacePolicy to generate natural and continuous facial movements.<n>Our approach significantly outperforms state-of-the-art methods.<n>It is particularly expert in dynamic, expressive and naturally smooth facial animations.
arXiv Detail & Related papers (2024-09-17T02:30:34Z)
High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation. We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw. Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z)
GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer [26.567649613966974]
Speech-driven 3D facial animation model based on a Graph Latent Transformer. GLDiTalker resolves misalignment by diffusing signals within a quantizedtemporal latent space. It employs a two-stage training pipeline: the Graph-Enhanced Space Quantized Learning Stage ensures lip-sync accuracy, and the Space-Time Powered Latent Diffusion Stage enhances motion diversity.
arXiv Detail & Related papers (2024-08-03T17:18:26Z)
SAiD: Speech-driven Blendshape Facial Animation with Diffusion [6.4271091365094515]
Speech-driven 3D facial animation is challenging due to the scarcity of large-scale visual-audio datasets. We propose a speech-driven 3D facial animation with a diffusion model (SAiD), a lightweight Transformer-based U-Net with a cross-modality alignment bias between audio and visual to enhance lip synchronization.
arXiv Detail & Related papers (2023-12-25T04:40:32Z)
FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal. To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv Detail & Related papers (2023-12-13T19:01:07Z)
DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation [10.73030153404956]
We propose a cross-modal dual-learning framework, termed DualTalker, to improve data usage efficiency. The framework is trained jointly with the primary task (audio-driven facial animation) and its dual task (lip reading) and shares common audio/motion encoder components. Our approach outperforms current state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-11-08T15:39:56Z)
DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion [68.85904927374165]
We propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis. It captures the complex one-to-many relationships between speech and 3D face based on diffusion. It simultaneously achieves more realistic facial animation than the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-23T04:14:55Z)
Pose-Controllable 3D Facial Animation Synthesis using Hierarchical Audio-Vertex Attention [52.63080543011595]
A novel pose-controllable 3D facial animation synthesis method is proposed by utilizing hierarchical audio-vertex attention. The proposed method can produce more realistic facial expressions and head posture movements.
arXiv Detail & Related papers (2023-02-24T09:36:31Z)
CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior [27.989344587876964]
Speech-driven 3D facial animation has been widely studied, yet there is still a gap to achieving realism and vividness. We propose to cast speech-driven facial animation as a code query task in a finite proxy space of the learned codebook. We demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-01-06T05:04:32Z)
FaceFormer: Speech-Driven 3D Facial Animation with Transformers [46.8780140220063]
Speech-driven 3D facial animation is challenging due to the complex geometry of human faces and the limited availability of 3D audio-visual data. We propose a Transformer-based autoregressive model, FaceFormer, which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes.
arXiv Detail & Related papers (2021-12-10T04:21:59Z)
Learning Speech-driven 3D Conversational Gestures from Video [106.15628979352738]
We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures. Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures. We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people.
arXiv Detail & Related papers (2021-02-13T01:05:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.