High-Fidelity and Freely Controllable Talking Head Video Generation
- URL: http://arxiv.org/abs/2304.10168v2
- Date: Thu, 2 Nov 2023 03:07:15 GMT
- Title: High-Fidelity and Freely Controllable Talking Head Video Generation
- Authors: Yue Gao, Yuan Zhou, Jinglu Wang, Xiao Li, Xiang Ming, Yan Lu
- Abstract summary: We propose a novel model that produces high-fidelity talking head videos with free control over head pose and expression.
We introduce a novel motion-aware multi-scale feature alignment module to effectively transfer the motion without face distortion.
We evaluate our model on challenging datasets and demonstrate its state-of-the-art performance.
- Score: 31.08828907637289
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Talking head generation is to generate video based on a given source identity
and target motion. However, current methods face several challenges that limit
the quality and controllability of the generated videos. First, the generated
face often has unexpected deformation and severe distortions. Second, the
driving image does not explicitly disentangle movement-relevant information,
such as poses and expressions, which restricts the manipulation of different
attributes during generation. Third, the generated videos tend to have
flickering artifacts due to the inconsistency of the extracted landmarks
between adjacent frames. In this paper, we propose a novel model that produces
high-fidelity talking head videos with free control over head pose and
expression. Our method leverages both self-supervised learned landmarks and 3D
face model-based landmarks to model the motion. We also introduce a novel
motion-aware multi-scale feature alignment module to effectively transfer the
motion without face distortion. Furthermore, we enhance the smoothness of the
synthesized talking head videos with a feature context adaptation and
propagation module. We evaluate our model on challenging datasets and
demonstrate its state-of-the-art performance.
Related papers
- VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models [71.9811050853964]
VideoJAM is a novel framework that instills an effective motion prior to video generators.
VideoJAM achieves state-of-the-art performance in motion coherence.
These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation.
arXiv Detail & Related papers (2025-02-04T17:07:10Z) - Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks [25.39030226963548]
We introduce the first application of a pretrained transformer-based video generative model for portrait animation.
Our method is validated through experiments on benchmark and newly proposed wild datasets.
arXiv Detail & Related papers (2024-12-01T08:54:30Z) - Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation [15.233839480474206]
Talking head video generation aims to generate a realistic talking head video that preserves the person's identity from a source image and the motion from a driving video.
Despite the promising progress made in the field, it remains a challenging and critical problem to generate videos with accurate poses and fine-grained facial details simultaneously.
We propose to jointly learn motion and appearance codebooks and perform multi-scale codebook compensation to effectively refine both the facial motion conditions and appearance features.
arXiv Detail & Related papers (2024-12-01T07:54:07Z) - MotionCrafter: One-Shot Motion Customization of Diffusion Models [66.44642854791807]
We introduce MotionCrafter, a one-shot instance-guided motion customization method.
MotionCrafter employs a parallel spatial-temporal architecture that injects the reference motion into the temporal component of the base model.
During training, a frozen base model provides appearance normalization, effectively separating appearance from motion.
arXiv Detail & Related papers (2023-12-08T16:31:04Z) - Generative Rendering: Controllable 4D-Guided Video Generation with 2D
Diffusion Models [40.71940056121056]
We present a novel approach that combines the controllability of dynamic 3D meshes with the expressivity and editability of emerging diffusion models.
We demonstrate our approach on various examples where motion can be obtained by animating rigged assets or changing the camera path.
arXiv Detail & Related papers (2023-12-03T14:17:11Z) - VMC: Video Motion Customization using Temporal Attention Adaption for
Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models.
Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference.
We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z) - Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking
Heads Generation [9.242997749920498]
This paper presents a novel approach for generating 3D talking heads from raw audio inputs.
Using landmarks in 3D talking head generation offers various advantages such as consistency, reliability, and obviating the need for manual-annotation.
arXiv Detail & Related papers (2023-06-02T10:04:57Z) - Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation [54.68893964373141]
Talking face generation has historically struggled to produce head movements and natural facial expressions without guidance from additional reference videos.
Recent developments in diffusion-based generative models allow for more realistic and stable data synthesis.
We present an autoregressive diffusion model that requires only one identity image and audio sequence to generate a video of a realistic talking human head.
arXiv Detail & Related papers (2023-01-06T14:16:54Z) - PIRenderer: Controllable Portrait Image Generation via Semantic Neural
Rendering [56.762094966235566]
A Portrait Image Neural Renderer is proposed to control the face motions with the parameters of three-dimensional morphable face models.
The proposed model can generate photo-realistic portrait images with accurate movements according to intuitive modifications.
Our model can generate coherent videos with convincing movements from only a single reference image and a driving audio stream.
arXiv Detail & Related papers (2021-09-17T07:24:16Z) - Talking-head Generation with Rhythmic Head Motion [46.6897675583319]
We propose a 3D-aware generative network with a hybrid embedding module and a non-linear composition module.
Our approach achieves controllable, photo-realistic, and temporally coherent talking-head videos with natural head movements.
arXiv Detail & Related papers (2020-07-16T18:13:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.