FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance, Head-pose, and Facial Expression Features
- URL: http://arxiv.org/abs/2404.09736v2
- Date: Mon, 10 Jun 2024 09:06:49 GMT
- Title: FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance, Head-pose, and Facial Expression Features
- Authors: Andre Rochow, Max Schwarz, Sven Behnke,
- Abstract summary: The task of face reenactment is to transfer the head motion and facial expressions from a driving video to the appearance of a source image.
Most existing methods are CNN-based and estimate optical flow from the source image to the current driving frame.
We propose a transformer-based encoder for computing a set-latent representation of the source image.
- Score: 17.531847357428454
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of face reenactment is to transfer the head motion and facial expressions from a driving video to the appearance of a source image, which may be of a different person (cross-reenactment). Most existing methods are CNN-based and estimate optical flow from the source image to the current driving frame, which is then inpainted and refined to produce the output animation. We propose a transformer-based encoder for computing a set-latent representation of the source image(s). We then predict the output color of a query pixel using a transformer-based decoder, which is conditioned with keypoints and a facial expression vector extracted from the driving frame. Latent representations of the source person are learned in a self-supervised manner that factorize their appearance, head pose, and facial expressions. Thus, they are perfectly suited for cross-reenactment. In contrast to most related work, our method naturally extends to multiple source images and can thus adapt to person-specific facial dynamics. We also propose data augmentation and regularization schemes that are necessary to prevent overfitting and support generalizability of the learned representations. We evaluated our approach in a randomized user study. The results indicate superior performance compared to the state-of-the-art in terms of motion transfer quality and temporal consistency.
Related papers
- OSDFace: One-Step Diffusion Model for Face Restoration [72.5045389847792]
Diffusion models have demonstrated impressive performance in face restoration.
We propose OSDFace, a novel one-step diffusion model for face restoration.
Results demonstrate that OSDFace surpasses current state-of-the-art (SOTA) methods in both visual quality and quantitative metrics.
arXiv Detail & Related papers (2024-11-26T07:07:48Z) - GaussianHeads: End-to-End Learning of Drivable Gaussian Head Avatars from Coarse-to-fine Representations [54.94362657501809]
We propose a new method to generate highly dynamic and deformable human head avatars from multi-view imagery in real-time.
At the core of our method is a hierarchical representation of head models that allows to capture the complex dynamics of facial expressions and head movements.
We train this coarse-to-fine facial avatar model along with the head pose as a learnable parameter in an end-to-end framework.
arXiv Detail & Related papers (2024-09-18T13:05:43Z) - Knowledge-Enhanced Facial Expression Recognition with Emotional-to-Neutral Transformation [66.53435569574135]
Existing facial expression recognition methods typically fine-tune a pre-trained visual encoder using discrete labels.
We observe that the rich knowledge in text embeddings, generated by vision-language models, is a promising alternative for learning discriminative facial expression representations.
We propose a novel knowledge-enhanced FER method with an emotional-to-neutral transformation.
arXiv Detail & Related papers (2024-09-13T07:28:57Z) - Learning Online Scale Transformation for Talking Head Video Generation [14.102629579320572]
One-shot talking head video generation uses a source image and driving video to create a synthetic video where the source person's facial movements imitate those of the driving video.
Existing methods attempt to locate a frame in the driving video that aligns best with the source image, but imprecise alignment can result in suboptimal outcomes.
We introduce a scale transformation module that can automatically adjust the scale of the driving image to fit that of the source image.
arXiv Detail & Related papers (2024-07-13T18:08:46Z) - Controllable Talking Face Generation by Implicit Facial Keypoints Editing [6.036277153327655]
We present ControlTalk, a talking face generation method to control face expression deformation based on driven audio.
Our experiments show that our method is superior to state-of-the-art performance on widely used benchmarks, including HDTF and MEAD.
arXiv Detail & Related papers (2024-06-05T02:54:46Z) - DiffusionAct: Controllable Diffusion Autoencoder for One-shot Face Reenactment [34.821255203019554]
Video-driven neural face reenactment aims to synthesize realistic facial images that successfully preserve the identity and appearance of a source face.
Recent advances in Diffusion Probabilistic Models (DPMs) enable the generation of high-quality realistic images.
We present Diffusion, a novel method that leverages the photo-realistic image generation of diffusion models to perform neural face reenactment.
arXiv Detail & Related papers (2024-03-25T21:46:53Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - SARGAN: Spatial Attention-based Residuals for Facial Expression
Manipulation [1.7056768055368383]
We present a novel method named SARGAN that addresses the limitations from three perspectives.
We exploited a symmetric encoder-decoder network to attend facial features at multiple scales.
Our proposed model performs significantly better than state-of-the-art methods.
arXiv Detail & Related papers (2023-03-30T08:15:18Z) - Semantic-aware One-shot Face Re-enactment with Dense Correspondence
Estimation [100.60938767993088]
One-shot face re-enactment is a challenging task due to the identity mismatch between source and driving faces.
This paper proposes to use 3D Morphable Model (3DMM) for explicit facial semantic decomposition and identity disentanglement.
arXiv Detail & Related papers (2022-11-23T03:02:34Z) - Emotion Separation and Recognition from a Facial Expression by Generating the Poker Face with Vision Transformers [57.1091606948826]
We propose a novel FER model, named Poker Face Vision Transformer or PF-ViT, to address these challenges.
PF-ViT aims to separate and recognize the disturbance-agnostic emotion from a static facial image via generating its corresponding poker face.
PF-ViT utilizes vanilla Vision Transformers, and its components are pre-trained as Masked Autoencoders on a large facial expression dataset.
arXiv Detail & Related papers (2022-07-22T13:39:06Z) - Mesh Guided One-shot Face Reenactment using Graph Convolutional Networks [31.083072922977568]
We introduce a method for one-shot face reenactment, which uses the reconstructed 3D meshes as guidance to learn the optical flow needed for the reenacted face synthesis.
We propose a motion net to learn the face motion, which is an asymmetric autoencoder.
Our method can generate high-quality results and outperforms state-of-the-art methods in both qualitative and quantitative comparisons.
arXiv Detail & Related papers (2020-08-18T07:41:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.