KSDiff: Keyframe-Augmented Speech-Aware Dual-Path Diffusion for Facial Animation
- URL: http://arxiv.org/abs/2509.20128v1
- Date: Wed, 24 Sep 2025 13:54:52 GMT
- Title: KSDiff: Keyframe-Augmented Speech-Aware Dual-Path Diffusion for Facial Animation
- Authors: Tianle Lyu, Junchuan Zhao, Ye Wang,
- Abstract summary: KSDiff is a Keyframe-Augmented Speech-Aware Dual-Path Diffusion framework.<n>It disentangles expression-related and head-pose-related features, while an autoregressive Keyframe Establishment Learning module predicts the most salient motion frames.<n>Experiments on HDTF and VoxCeleb demonstrate that KSDiff state-of-the-art performance, with improvements in both lip synchronization accuracy and head-pose naturalness.
- Score: 4.952724424448834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-driven facial animation has made significant progress in multimedia applications, with diffusion models showing strong potential for talking-face synthesis. However, most existing works treat speech features as a monolithic representation and fail to capture their fine-grained roles in driving different facial motions, while also overlooking the importance of modeling keyframes with intense dynamics. To address these limitations, we propose KSDiff, a Keyframe-Augmented Speech-Aware Dual-Path Diffusion framework. Specifically, the raw audio and transcript are processed by a Dual-Path Speech Encoder (DPSE) to disentangle expression-related and head-pose-related features, while an autoregressive Keyframe Establishment Learning (KEL) module predicts the most salient motion frames. These components are integrated into a Dual-path Motion generator to synthesize coherent and realistic facial motions. Extensive experiments on HDTF and VoxCeleb demonstrate that KSDiff achieves state-of-the-art performance, with improvements in both lip synchronization accuracy and head-pose naturalness. Our results highlight the effectiveness of combining speech disentanglement with keyframe-aware diffusion for talking-head generation.
Related papers
- MIBURI: Towards Expressive Interactive Gesture Synthesis [62.45332399212876]
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions.<n>Existing solutions for ECAs produce rigid, low-diversity motions that are unsuitable for human-like interaction.<n>We present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue.
arXiv Detail & Related papers (2026-03-03T18:59:51Z) - StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing [63.72095377128904]
The visual dubbing task aims to generate mouth movements synchronized with the driving audio.<n>Audio-only driving paradigms inadequately capture speaker-specific lip habits.<n>Blind-inpainting approaches produce visual artifacts when handling obstructions.
arXiv Detail & Related papers (2025-09-26T05:23:31Z) - Talking Head Generation via AU-Guided Landmark Prediction [48.30051606459973]
We propose a two-stage framework for audio-driven talking head generation with fine-grained expression control via facial Action Units (AUs)<n>In the first stage, a variational motion generator predicts temporally coherent landmark sequences from audio and AU intensities.<n>In the second stage, a diffusion-based synthesizer generates realistic, lip-synced videos conditioned on these landmarks and a reference image.
arXiv Detail & Related papers (2025-09-24T04:01:57Z) - HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis [55.92704600574577]
HM-Talker is a novel framework for generating high-fidelity, temporally coherent talking heads.<n>Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment.
arXiv Detail & Related papers (2025-08-14T12:01:52Z) - M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation [65.48046909056468]
We reformulate talking head generation into a unified framework comprising video preprocessing, motion representation, and rendering reconstruction.<n>M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness.
arXiv Detail & Related papers (2025-07-11T04:48:12Z) - GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expression [33.886734972316326]
GoHD is a framework designed to produce highly realistic, expressive, and controllable portrait videos from any reference identity with any motion.<n>An animation module utilizing latent navigation is introduced to improve the generalization ability across unseen input styles.<n>A conformer-structured conditional diffusion model is designed to guarantee head poses that are aware of prosody.<n>A two-stage training strategy is devised to decouple frequent and frame-wise lip motion distillation from the generation of other more temporally dependent but less audio-related motions.
arXiv Detail & Related papers (2024-12-12T14:12:07Z) - Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation [22.159117464397806]
We introduce a two-stage diffusion-based model for talking head generation.
The first stage involves generating synchronized facial landmarks based on the given speech.
In the second stage, these generated landmarks serve as a condition in the denoising process, aiming to optimize mouth jitter issues and generate high-fidelity, well-synchronized, and temporally coherent talking head videos.
arXiv Detail & Related papers (2024-08-03T10:19:38Z) - Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation [29.87407471246318]
This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations.
Our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module.
The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities.
arXiv Detail & Related papers (2024-06-13T04:33:20Z) - Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model [17.98911328064481]
Co-speech gestures can achieve superior visual effects in human-machine interaction.
We present a novel motion-decoupled framework to generate co-speech gesture videos.
Our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations.
arXiv Detail & Related papers (2024-04-02T11:40:34Z) - FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio.
Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency.
We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z) - DiffSpeaker: Speech-Driven 3D Facial Animation with Diffusion
Transformer [110.32147183360843]
Speech-driven 3D facial animation is important for many multimedia applications.
Recent work has shown promise in using either Diffusion models or Transformer architectures for this task.
We present DiffSpeaker, a Transformer-based network equipped with novel biased conditional attention modules.
arXiv Detail & Related papers (2024-02-08T14:39:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.