Related papers: Controllable Talking Face Generation by Implicit Facial Keypoints Editing

Controllable Talking Face Generation by Implicit Facial Keypoints Editing

URL: http://arxiv.org/abs/2406.02880v2
Date: Thu, 07 Nov 2024 02:26:49 GMT
Title: Controllable Talking Face Generation by Implicit Facial Keypoints Editing
Authors: Dong Zhao, Jiaying Shi, Wenjun Li, Shudong Wang, Shenghui Xu, Zhaoming Pan,
Abstract summary: We present ControlTalk, a talking face generation method to control face expression deformation based on driven audio. Our experiments show that our method is superior to state-of-the-art performance on widely used benchmarks, including HDTF and MEAD.
Score: 6.036277153327655
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Audio-driven talking face generation has garnered significant interest within the domain of digital human research. Existing methods are encumbered by intricate model architectures that are intricately dependent on each other, complicating the process of re-editing image or video inputs. In this work, we present ControlTalk, a talking face generation method to control face expression deformation based on driven audio, which can construct the head pose and facial expression including lip motion for both single image or sequential video inputs in a unified manner. By utilizing a pre-trained video synthesis renderer and proposing the lightweight adaptation, ControlTalk achieves precise and naturalistic lip synchronization while enabling quantitative control over mouth opening shape. Our experiments show that our method is superior to state-of-the-art performance on widely used benchmarks, including HDTF and MEAD. The parameterized adaptation demonstrates remarkable generalization capabilities, effectively handling expression deformation across same-ID and cross-ID scenarios, and extending its utility to out-of-domain portraits, regardless of languages. Code is available at https://github.com/NetEase-Media/ControlTalk.

Related papers

Mask-Free Audio-driven Talking Face Generation for Enhanced Visual Quality and Identity Preservation [54.52905471078152]
We propose a mask-free talking face generation approach while maintaining the 2D-based face editing task.<n>We transform the input images to have closed mouths, using a two-step landmark-based approach trained in an unpaired manner.
arXiv Detail & Related papers (2025-07-28T16:03:36Z)
EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion [49.55774551366049]
Diffusion models have revolutionized the field of talking head generation, yet still face challenges in expressiveness, controllability, and stability in long-time generation. We propose an EmotiveTalk framework to address these issues. Experimental results show that EmotiveTalk can generate expressive talking head videos, ensuring the promised controllability of emotions and stability during long-time generation.
arXiv Detail & Related papers (2024-11-23T04:38:51Z)
Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters. Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z)
FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio. Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency. We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z)
GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model. It can synthesize smooth lip dynamics while preserving the speaker's identity. Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z)
Parametric Implicit Face Representation for Audio-Driven Facial Reenactment [52.33618333954383]
We propose a novel audio-driven facial reenactment framework that is both controllable and can generate high-quality talking heads. Specifically, our parametric implicit representation parameterizes the implicit representation with interpretable parameters of 3D face models. Our method can generate more realistic results than previous methods with greater fidelity to the identities and talking styles of speakers.
arXiv Detail & Related papers (2023-06-13T07:08:22Z)
DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder [20.814063371439904]
We propose DAE-Talker to synthesize full video frames and produce natural head movements that align with the content of speech. We also introduce pose modelling in speech2latent for pose controllability. Our experiments show that DAE-Talker outperforms existing popular methods in lip-sync, video fidelity, and pose naturalness.
arXiv Detail & Related papers (2023-03-30T17:18:31Z)
Continuously Controllable Facial Expression Editing in Talking Face Videos [34.83353695337335]
Speech-related expressions and emotion-related expressions are often highly coupled. Traditional image-to-image translation methods cannot work well in our application. We propose a high-quality facial expression editing method for talking face videos.
arXiv Detail & Related papers (2022-09-17T09:05:47Z)
StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pretrained StyleGAN [49.917296433657484]
One-shot talking face generation aims at synthesizing a high-quality talking face video from an arbitrary portrait image. In this work, we investigate the latent feature space of a pre-trained StyleGAN and discover some excellent spatial transformation properties. We propose a novel unified framework based on a pre-trained StyleGAN that enables a set of powerful functionalities.
arXiv Detail & Related papers (2022-03-08T12:06:12Z)
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces. We operate on raw face images, using only a single photo as an identity reference. Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.