High-Fidelity and Freely Controllable Talking Head Video Generation
- URL: http://arxiv.org/abs/2304.10168v2
- Date: Thu, 2 Nov 2023 03:07:15 GMT
- Title: High-Fidelity and Freely Controllable Talking Head Video Generation
- Authors: Yue Gao, Yuan Zhou, Jinglu Wang, Xiao Li, Xiang Ming, Yan Lu
- Abstract summary: We propose a novel model that produces high-fidelity talking head videos with free control over head pose and expression.
We introduce a novel motion-aware multi-scale feature alignment module to effectively transfer the motion without face distortion.
We evaluate our model on challenging datasets and demonstrate its state-of-the-art performance.
- Score: 31.08828907637289
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Talking head generation is to generate video based on a given source identity
and target motion. However, current methods face several challenges that limit
the quality and controllability of the generated videos. First, the generated
face often has unexpected deformation and severe distortions. Second, the
driving image does not explicitly disentangle movement-relevant information,
such as poses and expressions, which restricts the manipulation of different
attributes during generation. Third, the generated videos tend to have
flickering artifacts due to the inconsistency of the extracted landmarks
between adjacent frames. In this paper, we propose a novel model that produces
high-fidelity talking head videos with free control over head pose and
expression. Our method leverages both self-supervised learned landmarks and 3D
face model-based landmarks to model the motion. We also introduce a novel
motion-aware multi-scale feature alignment module to effectively transfer the
motion without face distortion. Furthermore, we enhance the smoothness of the
synthesized talking head videos with a feature context adaptation and
propagation module. We evaluate our model on challenging datasets and
demonstrate its state-of-the-art performance.
Related papers
- One-Shot Pose-Driving Face Animation Platform [7.422568903818486]
We refine an existing Image2Video model by integrating a Face Locator and Motion Frame mechanism.
We optimize the model using extensive human face video datasets, significantly enhancing its ability to produce high-quality talking head videos.
We develop a demo platform using the Gradio framework, which streamlines the process, enabling users to quickly create customized talking head videos.
arXiv Detail & Related papers (2024-07-12T03:09:07Z) - MotionCrafter: One-Shot Motion Customization of Diffusion Models [66.44642854791807]
We introduce MotionCrafter, a one-shot instance-guided motion customization method.
MotionCrafter employs a parallel spatial-temporal architecture that injects the reference motion into the temporal component of the base model.
During training, a frozen base model provides appearance normalization, effectively separating appearance from motion.
arXiv Detail & Related papers (2023-12-08T16:31:04Z) - FAAC: Facial Animation Generation with Anchor Frame and Conditional
Control for Superior Fidelity and Editability [14.896554342627551]
We introduce a facial animation generation method that enhances both face identity fidelity and editing capabilities.
This approach incorporates the concept of an anchor frame to counteract the degradation of generative ability in original text-to-image models.
Our method's efficacy has been validated on multiple representative DreamBooth and LoRA models.
arXiv Detail & Related papers (2023-12-06T02:55:35Z) - Generative Rendering: Controllable 4D-Guided Video Generation with 2D
Diffusion Models [40.71940056121056]
We present a novel approach that combines the controllability of dynamic 3D meshes with the expressivity and editability of emerging diffusion models.
We demonstrate our approach on various examples where motion can be obtained by animating rigged assets or changing the camera path.
arXiv Detail & Related papers (2023-12-03T14:17:11Z) - VMC: Video Motion Customization using Temporal Attention Adaption for
Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models.
Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference.
We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z) - Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking
Heads Generation [9.242997749920498]
This paper presents a novel approach for generating 3D talking heads from raw audio inputs.
Using landmarks in 3D talking head generation offers various advantages such as consistency, reliability, and obviating the need for manual-annotation.
arXiv Detail & Related papers (2023-06-02T10:04:57Z) - Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation [54.68893964373141]
Talking face generation has historically struggled to produce head movements and natural facial expressions without guidance from additional reference videos.
Recent developments in diffusion-based generative models allow for more realistic and stable data synthesis.
We present an autoregressive diffusion model that requires only one identity image and audio sequence to generate a video of a realistic talking human head.
arXiv Detail & Related papers (2023-01-06T14:16:54Z) - StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via
Pretrained StyleGAN [49.917296433657484]
One-shot talking face generation aims at synthesizing a high-quality talking face video from an arbitrary portrait image.
In this work, we investigate the latent feature space of a pre-trained StyleGAN and discover some excellent spatial transformation properties.
We propose a novel unified framework based on a pre-trained StyleGAN that enables a set of powerful functionalities.
arXiv Detail & Related papers (2022-03-08T12:06:12Z) - PIRenderer: Controllable Portrait Image Generation via Semantic Neural
Rendering [56.762094966235566]
A Portrait Image Neural Renderer is proposed to control the face motions with the parameters of three-dimensional morphable face models.
The proposed model can generate photo-realistic portrait images with accurate movements according to intuitive modifications.
Our model can generate coherent videos with convincing movements from only a single reference image and a driving audio stream.
arXiv Detail & Related papers (2021-09-17T07:24:16Z) - Talking-head Generation with Rhythmic Head Motion [46.6897675583319]
We propose a 3D-aware generative network with a hybrid embedding module and a non-linear composition module.
Our approach achieves controllable, photo-realistic, and temporally coherent talking-head videos with natural head movements.
arXiv Detail & Related papers (2020-07-16T18:13:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.