Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation
- URL: http://arxiv.org/abs/2104.11116v1
- Date: Thu, 22 Apr 2021 15:10:26 GMT
- Title: Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation
- Authors: Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang,
Ziwei Liu
- Abstract summary: We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
- Score: 96.66010515343106
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: While accurate lip synchronization has been achieved for arbitrary-subject
audio-driven talking face generation, the problem of how to efficiently drive
the head pose remains. Previous methods rely on pre-estimated structural
information such as landmarks and 3D parameters, aiming to generate
personalized rhythmic movements. However, the inaccuracy of such estimated
information under extreme conditions would lead to degradation problems. In
this paper, we propose a clean yet effective framework to generate
pose-controllable talking faces. We operate on raw face images, using only a
single photo as an identity reference. The key is to modularize audio-visual
representations by devising an implicit low-dimension pose code. Substantially,
both speech content and head pose information lie in a joint non-identity
embedding space. While speech content information can be defined by learning
the intrinsic synchronization between audio-visual modalities, we identify that
a pose code will be complementarily learned in a modulated convolution-based
reconstruction framework.
Extensive experiments show that our method generates accurately lip-synced
talking faces whose poses are controllable by other videos. Moreover, our model
has multiple advanced capabilities including extreme view robustness and
talking face frontalization. Code, models, and demo videos are available at
https://hangz-nju-cuhk.github.io/projects/PC-AVS.
Related papers
- High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation.
We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw.
Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z) - Controllable Talking Face Generation by Implicit Facial Keypoints Editing [6.036277153327655]
We present ControlTalk, a talking face generation method to control face expression deformation based on driven audio.
Our experiments show that our method is superior to state-of-the-art performance on widely used benchmarks, including HDTF and MEAD.
arXiv Detail & Related papers (2024-06-05T02:54:46Z) - FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio.
Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency.
We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with
Diffusion Autoencoder [20.814063371439904]
We propose DAE-Talker to synthesize full video frames and produce natural head movements that align with the content of speech.
We also introduce pose modelling in speech2latent for pose controllability.
Our experiments show that DAE-Talker outperforms existing popular methods in lip-sync, video fidelity, and pose naturalness.
arXiv Detail & Related papers (2023-03-30T17:18:31Z) - Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in
Transformers [91.00397473678088]
Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions.
We propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality.
Our model can generate high-fidelity lip-synced results for arbitrary subjects.
arXiv Detail & Related papers (2022-12-09T16:32:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.