Related papers: KSDiff: Keyframe-Augmented Speech-Aware Dual-Path Diffusion for Facial Animation

KSDiff: Keyframe-Augmented Speech-Aware Dual-Path Diffusion for Facial Animation

URL: http://arxiv.org/abs/2509.20128v1
Date: Wed, 24 Sep 2025 13:54:52 GMT
Title: KSDiff: Keyframe-Augmented Speech-Aware Dual-Path Diffusion for Facial Animation
Authors: Tianle Lyu, Junchuan Zhao, Ye Wang,
Abstract summary: KSDiff is a Keyframe-Augmented Speech-Aware Dual-Path Diffusion framework.<n>It disentangles expression-related and head-pose-related features, while an autoregressive Keyframe Establishment Learning module predicts the most salient motion frames.<n>Experiments on HDTF and VoxCeleb demonstrate that KSDiff state-of-the-art performance, with improvements in both lip synchronization accuracy and head-pose naturalness.
Score: 4.952724424448834
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Audio-driven facial animation has made significant progress in multimedia applications, with diffusion models showing strong potential for talking-face synthesis. However, most existing works treat speech features as a monolithic representation and fail to capture their fine-grained roles in driving different facial motions, while also overlooking the importance of modeling keyframes with intense dynamics. To address these limitations, we propose KSDiff, a Keyframe-Augmented Speech-Aware Dual-Path Diffusion framework. Specifically, the raw audio and transcript are processed by a Dual-Path Speech Encoder (DPSE) to disentangle expression-related and head-pose-related features, while an autoregressive Keyframe Establishment Learning (KEL) module predicts the most salient motion frames. These components are integrated into a Dual-path Motion generator to synthesize coherent and realistic facial motions. Extensive experiments on HDTF and VoxCeleb demonstrate that KSDiff achieves state-of-the-art performance, with improvements in both lip synchronization accuracy and head-pose naturalness. Our results highlight the effectiveness of combining speech disentanglement with keyframe-aware diffusion for talking-head generation.

Related papers

MIBURI: Towards Expressive Interactive Gesture Synthesis [62.45332399212876]
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions.<n>Existing solutions for ECAs produce rigid, low-diversity motions that are unsuitable for human-like interaction.<n>We present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue.
arXiv Detail & Related papers (2026-03-03T18:59:51Z)
StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing [63.72095377128904]
The visual dubbing task aims to generate mouth movements synchronized with the driving audio.<n>Audio-only driving paradigms inadequately capture speaker-specific lip habits.<n>Blind-inpainting approaches produce visual artifacts when handling obstructions.
arXiv Detail & Related papers (2025-09-26T05:23:31Z)
Talking Head Generation via AU-Guided Landmark Prediction [48.30051606459973]
We propose a two-stage framework for audio-driven talking head generation with fine-grained expression control via facial Action Units (AUs)<n>In the first stage, a variational motion generator predicts temporally coherent landmark sequences from audio and AU intensities.<n>In the second stage, a diffusion-based synthesizer generates realistic, lip-synced videos conditioned on these landmarks and a reference image.
arXiv Detail & Related papers (2025-09-24T04:01:57Z)
HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis [55.92704600574577]
HM-Talker is a novel framework for generating high-fidelity, temporally coherent talking heads.<n>Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment.
arXiv Detail & Related papers (2025-08-14T12:01:52Z)
M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation [65.48046909056468]
We reformulate talking head generation into a unified framework comprising video preprocessing, motion representation, and rendering reconstruction.<n>M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness.
arXiv Detail & Related papers (2025-07-11T04:48:12Z)
GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expression [33.886734972316326]
GoHD is a framework designed to produce highly realistic, expressive, and controllable portrait videos from any reference identity with any motion.<n>An animation module utilizing latent navigation is introduced to improve the generalization ability across unseen input styles.<n>A conformer-structured conditional diffusion model is designed to guarantee head poses that are aware of prosody.<n>A two-stage training strategy is devised to decouple frequent and frame-wise lip motion distillation from the generation of other more temporally dependent but less audio-related motions.
arXiv Detail & Related papers (2024-12-12T14:12:07Z)
Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation [22.159117464397806]
We introduce a two-stage diffusion-based model for talking head generation. The first stage involves generating synchronized facial landmarks based on the given speech. In the second stage, these generated landmarks serve as a condition in the denoising process, aiming to optimize mouth jitter issues and generate high-fidelity, well-synchronized, and temporally coherent talking head videos.
arXiv Detail & Related papers (2024-08-03T10:19:38Z)
Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation [29.87407471246318]
This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations. Our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module. The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities.
arXiv Detail & Related papers (2024-06-13T04:33:20Z)
Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model [17.98911328064481]
Co-speech gestures can achieve superior visual effects in human-machine interaction. We present a novel motion-decoupled framework to generate co-speech gesture videos. Our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations.
arXiv Detail & Related papers (2024-04-02T11:40:34Z)
FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio. Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency. We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z)
DiffSpeaker: Speech-Driven 3D Facial Animation with Diffusion Transformer [110.32147183360843]
Speech-driven 3D facial animation is important for many multimedia applications. Recent work has shown promise in using either Diffusion models or Transformer architectures for this task. We present DiffSpeaker, a Transformer-based network equipped with novel biased conditional attention modules.
arXiv Detail & Related papers (2024-02-08T14:39:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.