Portrait4D: Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data
- URL: http://arxiv.org/abs/2311.18729v2
- Date: Mon, 3 Jun 2024 08:16:22 GMT
- Title: Portrait4D: Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data
- Authors: Yu Deng, Duomin Wang, Xiaohang Ren, Xingyu Chen, Baoyuan Wang,
- Abstract summary: We present a method to learn one-shot 4D head synthesis via large-scale synthetic data.
A novel learning strategy is enforced to enhance the generalizability to real images by disentangling the learning process of 3D reconstruction and reenactment.
- Score: 27.109881339132258
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing one-shot 4D head synthesis methods usually learn from monocular videos with the aid of 3DMM reconstruction, yet the latter is evenly challenging which restricts them from reasonable 4D head synthesis. We present a method to learn one-shot 4D head synthesis via large-scale synthetic data. The key is to first learn a part-wise 4D generative model from monocular images via adversarial learning, to synthesize multi-view images of diverse identities and full motions as training data; then leverage a transformer-based animatable triplane reconstructor to learn 4D head reconstruction using the synthetic data. A novel learning strategy is enforced to enhance the generalizability to real images by disentangling the learning process of 3D reconstruction and reenactment. Experiments demonstrate our superiority over the prior art.
Related papers
- ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors [51.06020148149403]
We introduce ArtHOI, the first zero-shot framework for articulated human-object interaction synthesis via 4D reconstruction from video priors.<n>ArtHOI bridges video-based generation and geometry-aware reconstruction, producing interactions that are both semantically aligned and physically grounded.
arXiv Detail & Related papers (2026-03-04T17:58:04Z) - Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image [88.71287865590273]
We introduce TrajScene-60K, a large-scale dataset of 60,000 video samples with dense point trajectories.<n>We propose a diffusion-based 4D Scene Trajectory Generator (4D-STraG) to jointly generate geometrically consistent and motion-plausible 4D trajectories.<n>We then propose a 4D View Synthesis Module (4D-Vi) to render videos with arbitrary camera trajectories from 4D point track representations.
arXiv Detail & Related papers (2025-12-04T17:59:10Z) - Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models [79.06910348413861]
We introduce Diff4Splat, a feed-forward method that synthesizes controllable and explicit 4D scenes from a single image.<n>Given a single input image, a camera trajectory, and an optional text prompt, Diff4Splat directly predicts a deformable 3D Gaussian field that encodes appearance, geometry, and motion.
arXiv Detail & Related papers (2025-11-01T11:16:25Z) - Advances in Feed-Forward 3D Reconstruction and View Synthesis: A Survey [154.50661618628433]
3D reconstruction and view synthesis are foundational problems in computer vision, graphics, and immersive technologies such as augmented reality (AR), virtual reality (VR), and digital twins.<n>Recent advances in feed-forward approaches, driven by deep learning, have revolutionized this field by enabling fast and generalizable 3D reconstruction and view synthesis.
arXiv Detail & Related papers (2025-07-19T06:13:25Z) - Synthetic Prior for Few-Shot Drivable Head Avatar Inversion [61.51887011274453]
We present SynShot, a novel method for the few-shot inversion of a drivable head avatar based on a synthetic prior.
Inspired by machine learning models trained solely on synthetic data, we propose a method that learns a prior model from a large dataset of synthetic heads.
arXiv Detail & Related papers (2025-01-12T19:01:05Z) - FaceLift: Single Image to 3D Head with View Generation and GS-LRM [54.24070918942727]
FaceLift is a feed-forward approach for rapid, high-quality, 360-degree head reconstruction from a single image.
We show that FaceLift outperforms state-of-the-art methods in 3D head reconstruction, highlighting its practical applicability and robust performance on real-world images.
arXiv Detail & Related papers (2024-12-23T18:59:49Z) - Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis [60.853577108780414]
Existing 4D generation methods can generate high-quality 4D objects or scenes based on user-friendly conditions.
We propose Trans4D, a novel text-to-4D synthesis framework that enables realistic complex scene transitions.
In experiments, Trans4D consistently outperforms existing state-of-the-art methods in generating 4D scenes with accurate and high-quality transitions.
arXiv Detail & Related papers (2024-10-09T17:56:03Z) - Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer [13.969883154405995]
We propose a novel learning approach for feed-forward one-shot 4D head avatar synthesis.
We employ pseudo multi-view videos to learn a 4D head synthesizer in a data-driven manner.
arXiv Detail & Related papers (2024-03-20T13:09:54Z) - Denoising Diffusion via Image-Based Rendering [54.20828696348574]
We introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes.
First, we introduce a new neural scene representation, IB-planes, that can efficiently and accurately represent large 3D scenes.
Second, we propose a denoising-diffusion framework to learn a prior over this novel 3D scene representation, using only 2D images.
arXiv Detail & Related papers (2024-02-05T19:00:45Z) - Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed
Diffusion Models [94.07744207257653]
We focus on the underexplored text-to-4D setting and synthesize dynamic, animated 3D objects.
We combine text-to-image, text-to-video, and 3D-aware multiview diffusion models to provide feedback during 4D object optimization.
arXiv Detail & Related papers (2023-12-21T11:41:02Z) - 4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling [91.99172731031206]
Current text-to-4D methods face a three-way tradeoff between quality of scene appearance, 3D structure, and motion.
We introduce hybrid score distillation sampling, an alternating optimization procedure that blends supervision signals from multiple pre-trained diffusion models.
arXiv Detail & Related papers (2023-11-29T18:58:05Z) - H4D: Human 4D Modeling by Learning Neural Compositional Representation [75.34798886466311]
This work presents a novel framework that can effectively learn a compact and compositional representation for dynamic human.
A simple yet effective linear motion model is proposed to provide a rough and regularized motion estimation.
Experiments demonstrate our method is not only efficacy in recovering dynamic human with accurate motion and detailed geometry, but also amenable to various 4D human related tasks.
arXiv Detail & Related papers (2022-03-02T17:10:49Z) - Beyond Flatland: Pre-training with a Strong 3D Inductive Bias [5.577231009305908]
Kataoka et al., 2020 introduced a technique to eliminate the need for natural images in supervised deep learning.
We take inspiration from their work and build on this idea using 3D procedural object renders.
Similar to the previous work, our training corpus will be fully synthetic and derived from simple procedural strategies.
arXiv Detail & Related papers (2021-11-30T21:30:24Z) - Learning Compositional Representation for 4D Captures with Neural ODE [72.56606274691033]
We introduce a compositional representation for 4D captures, that disentangles shape, initial state, and motion respectively.
To model the motion, a neural Ordinary Differential Equation (ODE) is trained to update the initial state conditioned on the learned motion code.
A decoder takes the shape code and the updated pose code to reconstruct 4D captures at each time stamp.
arXiv Detail & Related papers (2021-03-15T10:55:55Z) - Learning to Generate Customized Dynamic 3D Facial Expressions [47.5220752079009]
We study 3D image-to-video translation with a particular focus on 4D facial expressions.
We employ a deep mesh-decoder like architecture to synthesize realistic high resolution facial expressions.
We trained our model using a high resolution dataset with 4D scans of six facial expressions from 180 subjects.
arXiv Detail & Related papers (2020-07-19T22:38:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.