Related papers: Exploring Talking Head Models With Adjacent Frame Prior for Speech-Preserving Facial Expression Manipulation

Exploring Talking Head Models With Adjacent Frame Prior for Speech-Preserving Facial Expression Manipulation

URL: http://arxiv.org/abs/2601.12876v1
Date: Mon, 19 Jan 2026 09:31:24 GMT
Title: Exploring Talking Head Models With Adjacent Frame Prior for Speech-Preserving Facial Expression Manipulation
Authors: Zhenxuan Lu, Zhihua Xu, Zhijing Yang, Feng Gao, Yongyi Lu, Keze Wang, Tianshui Chen,
Abstract summary: Speech-Preserving Facial Expression Manipulation (SPFEM) is an innovative technique aimed at altering facial expressions in images and videos.<n>Despite advancements, SPFEM still struggles with accurate lip synchronization due to the complex interplay between facial expressions and mouth shapes.<n>We present a new framework, Talking Head Facial Expression Manipulation (THFEM), which utilizes AD-THG models to generate frames with accurately synchronized lip movements.
Score: 34.89590516635867
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speech-Preserving Facial Expression Manipulation (SPFEM) is an innovative technique aimed at altering facial expressions in images and videos while retaining the original mouth movements. Despite advancements, SPFEM still struggles with accurate lip synchronization due to the complex interplay between facial expressions and mouth shapes. Capitalizing on the advanced capabilities of audio-driven talking head generation (AD-THG) models in synthesizing precise lip movements, our research introduces a novel integration of these models with SPFEM. We present a new framework, Talking Head Facial Expression Manipulation (THFEM), which utilizes AD-THG models to generate frames with accurately synchronized lip movements from audio inputs and SPFEM-altered images. However, increasing the number of frames generated by AD-THG models tends to compromise the realism and expression fidelity of the images. To counter this, we develop an adjacent frame learning strategy that finetunes AD-THG models to predict sequences of consecutive frames. This strategy enables the models to incorporate information from neighboring frames, significantly improving image quality during testing. Our extensive experimental evaluations demonstrate that this framework effectively preserves mouth shapes during expression manipulations, highlighting the substantial benefits of integrating AD-THG with SPFEM.

Related papers

MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity Control [48.94486508604052]
MAGIC-Talk is a one-shot diffusion-based framework for customizable talking face generation.<n> ReferenceNet preserves identity and enables fine-grained facial editing via text prompts.<n>AnimateNet enhances motion coherence using structured motion priors.
arXiv Detail & Related papers (2025-10-26T19:49:31Z)
Audio-Driven Universal Gaussian Head Avatars [66.56656075831954]
We introduce the first method for audio-driven universal photorealistic avatar synthesis.<n>It combines a person-agnostic speech model with our novel Universal Head Avatar Prior.<n>Our method is not only the first general audio-driven avatar model that can account for detailed appearance modeling and rendering.
arXiv Detail & Related papers (2025-09-23T12:46:43Z)
PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation [48.94486508604052]
We introduce a novel, customizable one-shot audio-driven talking face generation framework, named PortraitTalk.<n>Our proposed method utilizes a latent diffusion framework consisting of two main components: IdentityNet and AnimateNet.<n>Key innovation of PortraitTalk is the incorporation of text prompts through decoupled cross-attention mechanisms.
arXiv Detail & Related papers (2024-12-10T18:51:31Z)
High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation. We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw. Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z)
Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation [29.87407471246318]
This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations. Our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module. The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities.
arXiv Detail & Related papers (2024-06-13T04:33:20Z)
Controllable Talking Face Generation by Implicit Facial Keypoints Editing [6.036277153327655]
We present ControlTalk, a talking face generation method to control face expression deformation based on driven audio. Our experiments show that our method is superior to state-of-the-art performance on widely used benchmarks, including HDTF and MEAD.
arXiv Detail & Related papers (2024-06-05T02:54:46Z)
Realistic Speech-to-Face Generation with Speech-Conditioned Latent Diffusion Model with Face Prior [13.198105709331617]
We propose a novel speech-to-face generation framework, which leverages a Speech-Conditioned Latent Diffusion Model, called SCLDM. This is the first work to harness the exceptional modeling capabilities of diffusion models for speech-to-face generation. We show that our method can produce more realistic face images while preserving the identity of the speaker better than state-of-the-art methods.
arXiv Detail & Related papers (2023-10-05T07:44:49Z)
DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder [55.58582254514431]
We propose DAE-Talker to synthesize full video frames and produce natural head movements that align with the content of speech.<n>We also introduce pose modelling in speech2latent for pose controllability.<n>Our experiments show that DAE-Talker outperforms existing popular methods in lip-sync, video fidelity, and pose naturalness.
arXiv Detail & Related papers (2023-03-30T17:18:31Z)
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces. We operate on raw face images, using only a single photo as an identity reference. Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.