DaGAN++: Depth-Aware Generative Adversarial Network for Talking Head
Video Generation
- URL: http://arxiv.org/abs/2305.06225v2
- Date: Sun, 10 Dec 2023 05:20:24 GMT
- Title: DaGAN++: Depth-Aware Generative Adversarial Network for Talking Head
Video Generation
- Authors: Fa-Ting Hong, Li Shen, and Dan Xu
- Abstract summary: We present a novel self-supervised method for learning dense 3D facial geometry from face videos.
We also propose a strategy to learn pixel-level uncertainties to perceive more reliable rigid-motion pixels for geometry learning.
We develop a 3D-aware cross-modal (ie, appearance and depth) attention mechanism to capture facial geometries in a coarse-to-fine manner.
- Score: 18.511092587156657
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Predominant techniques on talking head generation largely depend on 2D
information, including facial appearances and motions from input face images.
Nevertheless, dense 3D facial geometry, such as pixel-wise depth, plays a
critical role in constructing accurate 3D facial structures and suppressing
complex background noises for generation. However, dense 3D annotations for
facial videos is prohibitively costly to obtain. In this work, firstly, we
present a novel self-supervised method for learning dense 3D facial geometry
(ie, depth) from face videos, without requiring camera parameters and 3D
geometry annotations in training. We further propose a strategy to learn
pixel-level uncertainties to perceive more reliable rigid-motion pixels for
geometry learning. Secondly, we design an effective geometry-guided facial
keypoint estimation module, providing accurate keypoints for generating motion
fields. Lastly, we develop a 3D-aware cross-modal (ie, appearance and depth)
attention mechanism, which can be applied to each generation layer, to capture
facial geometries in a coarse-to-fine manner. Extensive experiments are
conducted on three challenging benchmarks (ie, VoxCeleb1, VoxCeleb2, and HDTF).
The results demonstrate that our proposed framework can generate highly
realistic-looking reenacted talking videos, with new state-of-the-art
performances established on these benchmarks. The codes and trained models are
publicly available on the GitHub project page at
https://github.com/harlanhong/CVPR2022-DaGAN
Related papers
- Invisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting [75.7154104065613]
We introduce a novel depth completion model, trained via teacher distillation and self-training to learn the 3D fusion process.
We also introduce a new benchmarking scheme for scene generation methods that is based on ground truth geometry.
arXiv Detail & Related papers (2024-04-30T17:59:40Z) - 3D Face Tracking from 2D Video through Iterative Dense UV to Image Flow [15.479024531161476]
We propose a novel face tracker, FlowFace, that introduces an innovative 2D alignment network for dense per-vertex alignment.
Unlike prior work, FlowFace is trained on high-quality 3D scan annotations rather than weak supervision or synthetic data.
Our method exhibits superior performance on both custom and publicly available benchmarks.
arXiv Detail & Related papers (2024-04-15T14:20:07Z) - 4D Facial Expression Diffusion Model [3.507793603897647]
We introduce a generative framework for generating 3D facial expression sequences.
It is composed of two tasks: Learning the generative model that is trained over a set of 3D landmark sequences, and Generating 3D mesh sequences of an input facial mesh driven by the generated landmark sequences.
Experiments show that our model has learned to generate realistic, quality expressions solely from the dataset of relatively small size, improving over the state-of-the-art methods.
arXiv Detail & Related papers (2023-03-29T11:50:21Z) - Self-Supervised Geometry-Aware Encoder for Style-Based 3D GAN Inversion [115.82306502822412]
StyleGAN has achieved great progress in 2D face reconstruction and semantic editing via image inversion and latent editing.
A corresponding generic 3D GAN inversion framework is still missing, limiting the applications of 3D face reconstruction and semantic editing.
We study the challenging problem of 3D GAN inversion where a latent code is predicted given a single face image to faithfully recover its 3D shapes and detailed textures.
arXiv Detail & Related papers (2022-12-14T18:49:50Z) - MvDeCor: Multi-view Dense Correspondence Learning for Fine-grained 3D
Segmentation [91.6658845016214]
We propose to utilize self-supervised techniques in the 2D domain for fine-grained 3D shape segmentation tasks.
We render a 3D shape from multiple views, and set up a dense correspondence learning task within the contrastive learning framework.
As a result, the learned 2D representations are view-invariant and geometrically consistent.
arXiv Detail & Related papers (2022-08-18T00:48:15Z) - PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal
Distillation for 3D Shape Recognition [55.38462937452363]
We propose a unified multi-view cross-modal distillation architecture, including a pretrained deep image encoder as the teacher and a deep point encoder as the student.
By pair-wise aligning multi-view visual and geometric descriptors, we can obtain more powerful deep point encoders without exhausting and complicated network modification.
arXiv Detail & Related papers (2022-07-07T07:23:20Z) - Copy Motion From One to Another: Fake Motion Video Generation [53.676020148034034]
A compelling application of artificial intelligence is to generate a video of a target person performing arbitrary desired motion.
Current methods typically employ GANs with a L2 loss to assess the authenticity of the generated videos.
We propose a theoretically motivated Gromov-Wasserstein loss that facilitates learning the mapping from a pose to a foreground image.
Our method is able to generate realistic target person videos, faithfully copying complex motions from a source person.
arXiv Detail & Related papers (2022-05-03T08:45:22Z) - Depth-Aware Generative Adversarial Network for Talking Head Video
Generation [15.43672834991479]
Talking head video generation aims to produce a synthetic human face video that contains the identity and pose information respectively from a given source image and a driving video.
Existing works for this task heavily rely on 2D representations (e.g. appearance and motion) learned from the input images.
In this paper, we introduce a self-supervised geometry learning method to automatically recover the dense 3D geometry (i.e.depth) from the face videos.
arXiv Detail & Related papers (2022-03-13T09:32:22Z) - 3D Facial Geometry Recovery from a Depth View with Attention Guided
Generative Adversarial Network [27.773904952734547]
We present to recover the complete 3D facial geometry from a single depth view by proposing an Attention Guided Generative Adversarial Networks (AGGAN)
Specifically, AGGAN encodes the 3D facial geometry within a voxel space and utilizes an attention-guided GAN to model the illposed 2.5D depth-3D mapping.
Both qualitative and quantitative comparisons show that AGGAN recovers a more complete and smoother 3D facial shape, with the capability to handle a much wider range of view angles and resist to noise in the depth view than conventional methods.
arXiv Detail & Related papers (2020-09-02T10:35:26Z) - DeepFaceFlow: In-the-wild Dense 3D Facial Motion Estimation [56.56575063461169]
DeepFaceFlow is a robust, fast, and highly-accurate framework for the estimation of 3D non-rigid facial flow.
Our framework was trained and tested on two very large-scale facial video datasets.
Given registered pairs of images, our framework generates 3D flow maps at 60 fps.
arXiv Detail & Related papers (2020-05-14T23:56:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.