Related papers: KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation

KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation

URL: http://arxiv.org/abs/2503.01715v2
Date: Wed, 19 Mar 2025 12:10:34 GMT
Title: KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation
Authors: Antoni Bigata, Michał Stypułkowski, Rodrigo Mira, Stella Bounareli, Konstantinos Vougioukas, Zoe Landgraf, Nikita Drobyshev, Maciej Zieba, Stavros Petridis, Maja Pantic,
Abstract summary: KeyFace is a novel two-stage diffusion-based framework for facial animation.<n>In the first stage, a model fills in the gaps between transitions, ensuring smooth and temporal coherence.<n>To further enhance realism, we incorporate continuous emotion representations and handle a wide range of non-speech vocalizations (NSVs)<n> Experimental results show that KeyFace outperforms state-of-the-art methods in generating natural, coherent facial animations over extended durations.
Score: 37.27908280809964
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current audio-driven facial animation methods achieve impressive results for short videos but suffer from error accumulation and identity drift when extended to longer durations. Existing methods attempt to mitigate this through external spatial control, increasing long-term consistency but compromising the naturalness of motion. We propose KeyFace, a novel two-stage diffusion-based framework, to address these issues. In the first stage, keyframes are generated at a low frame rate, conditioned on audio input and an identity frame, to capture essential facial expressions and movements over extended periods of time. In the second stage, an interpolation model fills in the gaps between keyframes, ensuring smooth transitions and temporal coherence. To further enhance realism, we incorporate continuous emotion representations and handle a wide range of non-speech vocalizations (NSVs), such as laughter and sighs. We also introduce two new evaluation metrics for assessing lip synchronization and NSV generation. Experimental results show that KeyFace outperforms state-of-the-art methods in generating natural, coherent facial animations over extended durations, successfully encompassing NSVs and continuous emotions.

Related papers

X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio [27.619816538121327]
X-Actor generates lifelike, emotionally expressive talking head videos from a single reference image and an input audio clip.<n>By operating in a compact facial motion latent space decoupled from visual and identity cues, our autoregressive diffusion model effectively captures long-range correlations between audio and facial dynamics.<n>X-Actor produces compelling, cinematic-style performances that go beyond standard talking head animations.
arXiv Detail & Related papers (2025-08-04T22:57:01Z)
When Less Is More: A Sparse Facial Motion Structure For Listening Motion Learning [1.2974519529978974]
This study proposes a novel method for representing and predicting non-verbal facial motion by encoding long sequences into a sparse sequence of listenings and transition frames. By identifying crucial motion steps and interpolating intermediate frames, our method preserves the temporal structure of motion while enhancing instance-wise diversity during the learning process.
arXiv Detail & Related papers (2025-04-08T07:25:12Z)
Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model [64.11605839142348]
We introduce the textbfMotion-priors textbfConditional textbfDiffusion textbfModel (textbfMCDM), which utilizes both archived and current clip motion priors to enhance motion prediction and ensure temporal consistency.<n>We also release the textbfTalkingFace-Wild dataset, a multilingual collection of over 200 hours of footage across 10 languages.
arXiv Detail & Related papers (2025-02-13T17:50:23Z)
EMO2: End-Effector Guided Audio-Driven Avatar Video Generation [17.816939983301474]
We propose a novel audio-driven talking head method capable of simultaneously generating highly expressive facial expressions and hand gestures. In the first stage, we generate hand poses directly from audio input, leveraging the strong correlation between audio signals and hand movements. In the second stage, we employ a diffusion model to synthesize video frames, incorporating the hand poses generated in the first stage to produce realistic facial expressions and body movements.
arXiv Detail & Related papers (2025-01-18T07:51:29Z)
MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation [55.95148886437854]
Memory-guided EMOtion-aware diffusion (MEMO) is an end-to-end audio-driven portrait animation approach to generate talking videos.<n>MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.
arXiv Detail & Related papers (2024-12-05T18:57:26Z)
KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding [19.15471840100407]
We present a novel approach for synthesizing 3D facial motions from audio sequences using key motion embeddings. Our method integrates linguistic and data-driven priors through two modules: the linguistic-based key motion acquisition and the cross-modal motion completion. The latter extends key motions into a full sequence of 3D talking faces guided by audio features, improving temporal coherence and audio-visual consistency.
arXiv Detail & Related papers (2024-09-02T09:41:24Z)
High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation. We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw. Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z)
GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer [26.567649613966974]
Speech-driven 3D facial animation model based on a Graph Latent Transformer.<n> GLDiTalker resolves misalignment by diffusing signals within a quantizedtemporal latent space.<n>It employs a two-stage training pipeline: the Graph-Enhanced Space Quantized Learning Stage ensures lip-sync accuracy, and the Space-Time Powered Latent Diffusion Stage enhances motion diversity.
arXiv Detail & Related papers (2024-08-03T17:18:26Z)
FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio. Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency. We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z)
GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation [71.73912454164834]
A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency. NeRF has become a popular technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video. We propose GeneFace++ to handle these challenges by utilizing the rendering pitch contour as an auxiliary feature and introducing a temporal loss in the facial motion prediction process.
arXiv Detail & Related papers (2023-05-01T12:24:09Z)
Pose-Controllable 3D Facial Animation Synthesis using Hierarchical Audio-Vertex Attention [52.63080543011595]
A novel pose-controllable 3D facial animation synthesis method is proposed by utilizing hierarchical audio-vertex attention. The proposed method can produce more realistic facial expressions and head posture movements.
arXiv Detail & Related papers (2023-02-24T09:36:31Z)
CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior [27.989344587876964]
Speech-driven 3D facial animation has been widely studied, yet there is still a gap to achieving realism and vividness. We propose to cast speech-driven facial animation as a code query task in a finite proxy space of the learned codebook. We demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-01-06T05:04:32Z)
Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos [128.70585652795637]
TEL presents three unique challenges compared to temporal action localization. The emotions have extremely varied temporal dynamics. The fine-grained temporal annotations are complicated and labor-intensive.
arXiv Detail & Related papers (2022-08-03T10:00:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.