Implicit Identity Representation Conditioned Memory Compensation Network
for Talking Head video Generation
- URL: http://arxiv.org/abs/2307.09906v3
- Date: Fri, 18 Aug 2023 07:29:19 GMT
- Title: Implicit Identity Representation Conditioned Memory Compensation Network
for Talking Head video Generation
- Authors: Fa-Ting Hong and Dan Xu
- Abstract summary: Talking head video generation aims to animate a human face in a still image with dynamic poses and expressions using motion information.
Still source image cannot provide sufficient appearance information for occluded regions or delicate expression variations.
We propose a novel implicit identity representation conditioned memory compensation network, coined as MCNet, for high-fidelity talking head generation.
- Score: 16.66038865012963
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Talking head video generation aims to animate a human face in a still image
with dynamic poses and expressions using motion information derived from a
target-driving video, while maintaining the person's identity in the source
image. However, dramatic and complex motions in the driving video cause
ambiguous generation, because the still source image cannot provide sufficient
appearance information for occluded regions or delicate expression variations,
which produces severe artifacts and significantly degrades the generation
quality. To tackle this problem, we propose to learn a global facial
representation space, and design a novel implicit identity representation
conditioned memory compensation network, coined as MCNet, for high-fidelity
talking head generation.~Specifically, we devise a network module to learn a
unified spatial facial meta-memory bank from all training samples, which can
provide rich facial structure and appearance priors to compensate warped source
facial features for the generation. Furthermore, we propose an effective query
mechanism based on implicit identity representations learned from the discrete
keypoints of the source image. It can greatly facilitate the retrieval of more
correlated information from the memory bank for the compensation. Extensive
experiments demonstrate that MCNet can learn representative and complementary
facial memory, and can clearly outperform previous state-of-the-art talking
head generation methods on VoxCeleb1 and CelebV datasets. Please check our
\href{https://github.com/harlanhong/ICCV2023-MCNET}{Project}.
Related papers
- OSDFace: One-Step Diffusion Model for Face Restoration [72.5045389847792]
Diffusion models have demonstrated impressive performance in face restoration.
We propose OSDFace, a novel one-step diffusion model for face restoration.
Results demonstrate that OSDFace surpasses current state-of-the-art (SOTA) methods in both visual quality and quantitative metrics.
arXiv Detail & Related papers (2024-11-26T07:07:48Z) - GaussianHeads: End-to-End Learning of Drivable Gaussian Head Avatars from Coarse-to-fine Representations [54.94362657501809]
We propose a new method to generate highly dynamic and deformable human head avatars from multi-view imagery in real-time.
At the core of our method is a hierarchical representation of head models that allows to capture the complex dynamics of facial expressions and head movements.
We train this coarse-to-fine facial avatar model along with the head pose as a learnable parameter in an end-to-end framework.
arXiv Detail & Related papers (2024-09-18T13:05:43Z) - AniFaceDiff: High-Fidelity Face Reenactment via Facial Parametric Conditioned Diffusion Models [33.39336530229545]
Face reenactment refers to the process of transferring the pose and facial expressions from a reference (driving) video onto a static facial (source) image.
Previous research in this domain has made significant progress by training controllable deep generative models to generate faces.
This paper proposes a new method based on Stable Diffusion, called AniFaceDiff, incorporating a new conditioning module for high-fidelity face reenactment.
arXiv Detail & Related papers (2024-06-19T07:08:48Z) - FaceChain: A Playground for Human-centric Artificial Intelligence
Generated Content [36.48960592782015]
FaceChain is a personalized portrait generation framework that combines a series of customized image-generation model and a rich set of face-related perceptual understanding models.
We inject several SOTA face models into the generation procedure, achieving a more efficient label-tagging, data-processing, and model post-processing compared to previous solutions.
Based on FaceChain, we further develop several applications to build a broader playground for better showing its value, including virtual try on and 2D talking head.
arXiv Detail & Related papers (2023-08-28T02:20:44Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - Image-to-Video Generation via 3D Facial Dynamics [78.01476554323179]
We present a versatile model, FaceAnime, for various video generation tasks from still images.
Our model is versatile for various AR/VR and entertainment applications, such as face video and face video prediction.
arXiv Detail & Related papers (2021-05-31T02:30:11Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z) - Fine-grained Image-to-Image Transformation towards Visual Recognition [102.51124181873101]
We aim at transforming an image with a fine-grained category to synthesize new images that preserve the identity of the input image.
We adopt a model based on generative adversarial networks to disentangle the identity related and unrelated factors of an image.
Experiments on the CompCars and Multi-PIE datasets demonstrate that our model preserves the identity of the generated images much better than the state-of-the-art image-to-image transformation models.
arXiv Detail & Related papers (2020-01-12T05:26:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.