Depth-Aware Generative Adversarial Network for Talking Head Video
Generation
- URL: http://arxiv.org/abs/2203.06605v2
- Date: Tue, 15 Mar 2022 01:34:02 GMT
- Title: Depth-Aware Generative Adversarial Network for Talking Head Video
Generation
- Authors: Fa-Ting Hong, Longhao Zhang, Li Shen, and Dan Xu
- Abstract summary: Talking head video generation aims to produce a synthetic human face video that contains the identity and pose information respectively from a given source image and a driving video.
Existing works for this task heavily rely on 2D representations (e.g. appearance and motion) learned from the input images.
In this paper, we introduce a self-supervised geometry learning method to automatically recover the dense 3D geometry (i.e.depth) from the face videos.
- Score: 15.43672834991479
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Talking head video generation aims to produce a synthetic human face video
that contains the identity and pose information respectively from a given
source image and a driving video.Existing works for this task heavily rely on
2D representations (e.g. appearance and motion) learned from the input images.
However, dense 3D facial geometry (e.g. pixel-wise depth) is extremely
important for this task as it is particularly beneficial for us to essentially
generate accurate 3D face structures and distinguish noisy information from the
possibly cluttered background. Nevertheless, dense 3D geometry annotations are
prohibitively costly for videos and are typically not available for this video
generation task. In this paper, we first introduce a self-supervised geometry
learning method to automatically recover the dense 3D geometry (i.e.depth) from
the face videos without the requirement of any expensive 3D annotation data.
Based on the learned dense depth maps, we further propose to leverage them to
estimate sparse facial keypoints that capture the critical movement of the
human head. In a more dense way, the depth is also utilized to learn 3D-aware
cross-modal (i.e. appearance and depth) attention to guide the generation of
motion fields for warping source image representations. All these contributions
compose a novel depth-aware generative adversarial network (DaGAN) for talking
head generation. Extensive experiments conducted demonstrate that our proposed
method can generate highly realistic faces, and achieve significant results on
the unseen human faces.
Related papers
- FaceGPT: Self-supervised Learning to Chat about 3D Human Faces [69.4651241319356]
We introduce FaceGPT, a self-supervised learning framework for Large Vision-Language Models (VLMs) to reason about 3D human faces from images and text.
FaceGPT overcomes this limitation by embedding the parameters of a 3D morphable face model (3DMM) into the token space of a VLM.
We show that FaceGPT achieves high-quality 3D face reconstructions and retains the ability for general-purpose visual instruction following.
arXiv Detail & Related papers (2024-06-11T11:13:29Z) - ID-to-3D: Expressive ID-guided 3D Heads via Score Distillation Sampling [96.87575334960258]
ID-to-3D is a method to generate identity- and text-guided 3D human heads with disentangled expressions.
Results achieve an unprecedented level of identity-consistent and high-quality texture and geometry generation.
arXiv Detail & Related papers (2024-05-26T13:36:45Z) - DaGAN++: Depth-Aware Generative Adversarial Network for Talking Head
Video Generation [18.511092587156657]
We present a novel self-supervised method for learning dense 3D facial geometry from face videos.
We also propose a strategy to learn pixel-level uncertainties to perceive more reliable rigid-motion pixels for geometry learning.
We develop a 3D-aware cross-modal (ie, appearance and depth) attention mechanism to capture facial geometries in a coarse-to-fine manner.
arXiv Detail & Related papers (2023-05-10T14:58:33Z) - Graphics Capsule: Learning Hierarchical 3D Face Representations from 2D
Images [82.5266467869448]
We propose an Inverse Graphics Capsule Network (IGC-Net) to learn the hierarchical 3D face representations from large-scale unlabeled images.
IGC-Net first decomposes the objects into a set of semantic-consistent part-level descriptions and then assembles them into object-level descriptions to build the hierarchy.
arXiv Detail & Related papers (2023-03-20T06:32:55Z) - Copy Motion From One to Another: Fake Motion Video Generation [53.676020148034034]
A compelling application of artificial intelligence is to generate a video of a target person performing arbitrary desired motion.
Current methods typically employ GANs with a L2 loss to assess the authenticity of the generated videos.
We propose a theoretically motivated Gromov-Wasserstein loss that facilitates learning the mapping from a pose to a foreground image.
Our method is able to generate realistic target person videos, faithfully copying complex motions from a source person.
arXiv Detail & Related papers (2022-05-03T08:45:22Z) - Image-to-Video Generation via 3D Facial Dynamics [78.01476554323179]
We present a versatile model, FaceAnime, for various video generation tasks from still images.
Our model is versatile for various AR/VR and entertainment applications, such as face video and face video prediction.
arXiv Detail & Related papers (2021-05-31T02:30:11Z) - Multi-channel Deep 3D Face Recognition [4.726009758066045]
The accuracy of 2D face recognition is still challenged by the change of pose, illumination, make-up, and expression.
We propose a multi-Channel deep 3D face network for face recognition based on 3D face data.
The face recognition accuracy of the multi-Channel deep 3D face network has achieved 98.6.
arXiv Detail & Related papers (2020-09-30T15:29:05Z) - 3D Facial Geometry Recovery from a Depth View with Attention Guided
Generative Adversarial Network [27.773904952734547]
We present to recover the complete 3D facial geometry from a single depth view by proposing an Attention Guided Generative Adversarial Networks (AGGAN)
Specifically, AGGAN encodes the 3D facial geometry within a voxel space and utilizes an attention-guided GAN to model the illposed 2.5D depth-3D mapping.
Both qualitative and quantitative comparisons show that AGGAN recovers a more complete and smoother 3D facial shape, with the capability to handle a much wider range of view angles and resist to noise in the depth view than conventional methods.
arXiv Detail & Related papers (2020-09-02T10:35:26Z) - DeepFaceFlow: In-the-wild Dense 3D Facial Motion Estimation [56.56575063461169]
DeepFaceFlow is a robust, fast, and highly-accurate framework for the estimation of 3D non-rigid facial flow.
Our framework was trained and tested on two very large-scale facial video datasets.
Given registered pairs of images, our framework generates 3D flow maps at 60 fps.
arXiv Detail & Related papers (2020-05-14T23:56:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.