Navigating Large-Pose Challenge for High-Fidelity Face Reenactment with Video Diffusion Model
- URL: http://arxiv.org/abs/2507.16341v1
- Date: Tue, 22 Jul 2025 08:23:43 GMT
- Title: Navigating Large-Pose Challenge for High-Fidelity Face Reenactment with Video Diffusion Model
- Authors: Mingtao Guo, Guanyu Xing, Yanci Zhang, Yanli Liu,
- Abstract summary: Face reenactment aims to generate realistic talking head videos by transferring motion from a driving video to a static source image.<n>We present the Face Reenactment Video Diffusion model (FRVD), a novel framework for high-fidelity face reenactment under large pose changes.
- Score: 6.9344574901598595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Face reenactment aims to generate realistic talking head videos by transferring motion from a driving video to a static source image while preserving the source identity. Although existing methods based on either implicit or explicit keypoints have shown promise, they struggle with large pose variations due to warping artifacts or the limitations of coarse facial landmarks. In this paper, we present the Face Reenactment Video Diffusion model (FRVD), a novel framework for high-fidelity face reenactment under large pose changes. Our method first employs a motion extractor to extract implicit facial keypoints from the source and driving images to represent fine-grained motion and to perform motion alignment through a warping module. To address the degradation introduced by warping, we introduce a Warping Feature Mapper (WFM) that maps the warped source image into the motion-aware latent space of a pretrained image-to-video (I2V) model. This latent space encodes rich priors of facial dynamics learned from large-scale video data, enabling effective warping correction and enhancing temporal coherence. Extensive experiments show that FRVD achieves superior performance over existing methods in terms of pose accuracy, identity preservation, and visual quality, especially in challenging scenarios with extreme pose variations.
Related papers
- PEMF-VTO: Point-Enhanced Video Virtual Try-on via Mask-free Paradigm [21.1235226974745]
Video Virtual Try-on aims to seamlessly transfer a reference garment onto a target person in a video.<n>Existing methods typically rely on inpainting masks to define the try-on area.<n>We propose PEMF-VTO, a novel Point-Enhanced Mask-Free Video Virtual Try-On framework.
arXiv Detail & Related papers (2024-12-04T04:24:15Z) - Realistic and Efficient Face Swapping: A Unified Approach with Diffusion Models [69.50286698375386]
We propose a novel approach that better harnesses diffusion models for face-swapping.
We introduce a mask shuffling technique during inpainting training, which allows us to create a so-called universal model for swapping.
Ours is a relatively unified approach and so it is resilient to errors in other off-the-shelf models.
arXiv Detail & Related papers (2024-09-11T13:43:53Z) - WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos.
Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions.
We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion.
Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z) - VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation [79.99551055245071]
We propose VividPose, an end-to-end pipeline that ensures superior temporal stability.
An identity-aware appearance controller integrates additional facial information without compromising other appearance details.
A geometry-aware pose controller utilizes both dense rendering maps from SMPL-X and sparse skeleton maps.
VividPose exhibits superior generalization capabilities on our proposed in-the-wild dataset.
arXiv Detail & Related papers (2024-05-28T13:18:32Z) - DiffusionAct: Controllable Diffusion Autoencoder for One-shot Face Reenactment [34.821255203019554]
Video-driven neural face reenactment aims to synthesize realistic facial images that successfully preserve the identity and appearance of a source face.<n>Recent advances in Diffusion Probabilistic Models (DPMs) enable the generation of high-quality realistic images.<n>We present Diffusion, a novel method that leverages the photo-realistic image generation of diffusion models to perform neural face reenactment.
arXiv Detail & Related papers (2024-03-25T21:46:53Z) - HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and
Retarget Faces [47.27033282706179]
We present our method for neural face reenactment, called HyperReenact, that aims to generate realistic talking head images of a source identity.
Our method operates under the one-shot setting (i.e., using a single source frame) and allows for cross-subject reenactment, without requiring subject-specific fine-tuning.
We compare our method both quantitatively and qualitatively against several state-of-the-art techniques on the standard benchmarks of VoxCeleb1 and VoxCeleb2.
arXiv Detail & Related papers (2023-07-20T11:59:42Z) - Controllable One-Shot Face Video Synthesis With Semantic Aware Prior [10.968343822308812]
One-shot talking-head synthesis task aims to animate a source image to another pose and expression, which is dictated by a driving frame.
Recent methods rely on warping the appearance feature extracted from the source, by using motion fields estimated from the sparse keypoints, that are learned in an unsupervised manner.
We propose a novel method that leverages the rich face prior information.
arXiv Detail & Related papers (2023-04-27T19:17:13Z) - High-Fidelity and Freely Controllable Talking Head Video Generation [31.08828907637289]
We propose a novel model that produces high-fidelity talking head videos with free control over head pose and expression.
We introduce a novel motion-aware multi-scale feature alignment module to effectively transfer the motion without face distortion.
We evaluate our model on challenging datasets and demonstrate its state-of-the-art performance.
arXiv Detail & Related papers (2023-04-20T09:02:41Z) - DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder [55.58582254514431]
We propose DAE-Talker to synthesize full video frames and produce natural head movements that align with the content of speech.<n>We also introduce pose modelling in speech2latent for pose controllability.<n>Our experiments show that DAE-Talker outperforms existing popular methods in lip-sync, video fidelity, and pose naturalness.
arXiv Detail & Related papers (2023-03-30T17:18:31Z) - Video2StyleGAN: Encoding Video in Latent Space for Manipulation [63.03250800510085]
We propose a novel network to encode face videos into the latent space of StyleGAN for semantic face video manipulation.
Our approach can significantly outperform existing single image methods, while achieving real-time (66 fps) speed.
arXiv Detail & Related papers (2022-06-27T06:48:15Z) - Realistic Face Reenactment via Self-Supervised Disentangling of Identity
and Pose [23.211318473026243]
We propose a self-supervised hybrid model (DAE-GAN) that learns how to reenact face naturally given large amounts of unlabeled videos.
Our approach combines two deforming autoencoders with the latest advances in the conditional generation.
Experiment results demonstrate the superior quality of reenacted images and the flexibility of transferring facial movements between identities.
arXiv Detail & Related papers (2020-03-29T06:45:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.