3D-Aware Talking-Head Video Motion Transfer
- URL: http://arxiv.org/abs/2311.02549v1
- Date: Sun, 5 Nov 2023 02:50:45 GMT
- Title: 3D-Aware Talking-Head Video Motion Transfer
- Authors: Haomiao Ni, Jiachen Liu, Yuan Xue, Sharon X. Huang
- Abstract summary: We propose a 3D-aware talking-head video motion transfer network, Head3D.
Head3D exploits the subject appearance information by generating a visually-interpretable 3D canonical head from the 2D subject frames.
Our experiments on two public talking-head video datasets demonstrate that Head3D outperforms both 2D and 3D prior arts.
- Score: 20.135083791297603
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motion transfer of talking-head videos involves generating a new video with
the appearance of a subject video and the motion pattern of a driving video.
Current methodologies primarily depend on a limited number of subject images
and 2D representations, thereby neglecting to fully utilize the multi-view
appearance features inherent in the subject video. In this paper, we propose a
novel 3D-aware talking-head video motion transfer network, Head3D, which fully
exploits the subject appearance information by generating a
visually-interpretable 3D canonical head from the 2D subject frames with a
recurrent network. A key component of our approach is a self-supervised 3D head
geometry learning module, designed to predict head poses and depth maps from 2D
subject video frames. This module facilitates the estimation of a 3D head in
canonical space, which can then be transformed to align with driving video
frames. Additionally, we employ an attention-based fusion network to combine
the background and other details from subject frames with the 3D subject head
to produce the synthetic target video. Our extensive experiments on two public
talking-head video datasets demonstrate that Head3D outperforms both 2D and 3D
prior arts in the practical cross-identity setting, with evidence showing it
can be readily adapted to the pose-controllable novel view synthesis task.
Related papers
- Chat-Edit-3D: Interactive 3D Scene Editing via Text Prompts [76.73043724587679]
We propose a dialogue-based 3D scene editing approach, termed CE3D.
Hash-Atlas represents 3D scene views, which transfers the editing of 3D scenes onto 2D atlas images.
Results demonstrate that CE3D effectively integrates multiple visual models to achieve diverse editing visual effects.
arXiv Detail & Related papers (2024-07-09T13:24:42Z) - Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis [88.17520303867099]
One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio.
We present Real3D-Potrait, a framework that improves the one-shot 3D reconstruction power with a large image-to-plane model.
Experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos.
arXiv Detail & Related papers (2024-01-16T17:04:30Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - Audio-Driven 3D Facial Animation from In-the-Wild Videos [16.76533748243908]
Given an arbitrary audio clip, audio-driven 3D facial animation aims to generate lifelike lip motions and facial expressions for a 3D head.
Existing methods typically rely on training their models using limited public 3D datasets that contain a restricted number of audio-3D scan pairs.
We propose a novel method that leverages in-the-wild 2D talking-head videos to train our 3D facial animation model.
arXiv Detail & Related papers (2023-06-20T13:53:05Z) - DaGAN++: Depth-Aware Generative Adversarial Network for Talking Head
Video Generation [18.511092587156657]
We present a novel self-supervised method for learning dense 3D facial geometry from face videos.
We also propose a strategy to learn pixel-level uncertainties to perceive more reliable rigid-motion pixels for geometry learning.
We develop a 3D-aware cross-modal (ie, appearance and depth) attention mechanism to capture facial geometries in a coarse-to-fine manner.
arXiv Detail & Related papers (2023-05-10T14:58:33Z) - Text-To-4D Dynamic Scene Generation [111.89517759596345]
We present MAV3D (Make-A-Video3D), a method for generating three-dimensional dynamic scenes from text descriptions.
Our approach uses a 4D dynamic Neural Radiance Field (NeRF), which is optimized for scene appearance, density, and motion consistency.
The dynamic video output generated from the provided text can be viewed from any camera location and angle, and can be composited into any 3D environment.
arXiv Detail & Related papers (2023-01-26T18:14:32Z) - 3D-Aware Video Generation [149.5230191060692]
We explore 4D generative adversarial networks (GANs) that learn generation of 3D-aware videos.
By combining neural implicit representations with time-aware discriminator, we develop a GAN framework that synthesizes 3D video supervised only with monocular videos.
arXiv Detail & Related papers (2022-06-29T17:56:03Z) - Learning Ego 3D Representation as Ray Tracing [42.400505280851114]
We present a novel end-to-end architecture for ego 3D representation learning from unconstrained camera views.
Inspired by the ray tracing principle, we design a polarized grid of "imaginary eyes" as the learnable ego 3D representation.
We show that our model outperforms all state-of-the-art alternatives significantly.
arXiv Detail & Related papers (2022-06-08T17:55:50Z) - Video Autoencoder: self-supervised disentanglement of static 3D
structure and motion [60.58836145375273]
A video autoencoder is proposed for learning disentan- gled representations of 3D structure and camera pose from videos.
The representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video generation by motion following.
arXiv Detail & Related papers (2021-10-06T17:57:42Z) - Unsupervised object-centric video generation and decomposition in 3D [36.08064849807464]
We propose to model a video as the view seen while moving through a scene with multiple 3D objects and a 3D background.
Our model is trained from monocular videos without any supervision, yet learns to generate coherent 3D scenes containing several moving objects.
arXiv Detail & Related papers (2020-07-07T18:01:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.