Related papers: Coherent3D: Coherent 3D Portrait Video Reconstruction via Triplane Fusion

Coherent3D: Coherent 3D Portrait Video Reconstruction via Triplane Fusion

URL: http://arxiv.org/abs/2412.08684v1
Date: Wed, 11 Dec 2024 18:57:24 GMT
Title: Coherent3D: Coherent 3D Portrait Video Reconstruction via Triplane Fusion
Authors: Shengze Wang, Xueting Li, Chao Liu, Matthew Chan, Michael Stengel, Henry Fuchs, Shalini De Mello, Koki Nagano,
Abstract summary: Single-image 3D portrait reconstruction has enabled telepresence systems to stream 3D portrait videos from a single camera in real-time.<n>However, per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user's appearance.<n>We propose a new fusion-based method that takes the best of both worlds by fusing a canonical 3D prior from a reference view with dynamic appearance from per-frame input views.
Score: 22.185551913099598
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent breakthroughs in single-image 3D portrait reconstruction have enabled telepresence systems to stream 3D portrait videos from a single camera in real-time, democratizing telepresence. However, per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user's appearance. On the other hand, self-reenactment methods can render coherent 3D portraits by driving a 3D avatar built from a single reference image, but fail to faithfully preserve the user's per-frame appearance (e.g., instantaneous facial expression and lighting). As a result, none of these two frameworks is an ideal solution for democratized 3D telepresence. In this work, we address this dilemma and propose a novel solution that maintains both coherent identity and dynamic per-frame appearance to enable the best possible realism. To this end, we propose a new fusion-based method that takes the best of both worlds by fusing a canonical 3D prior from a reference view with dynamic appearance from per-frame input views, producing temporally stable 3D videos with faithful reconstruction of the user's per-frame appearance. Trained only using synthetic data produced by an expression-conditioned 3D GAN, our encoder-based method achieves both state-of-the-art 3D reconstruction and temporal consistency on in-studio and in-the-wild datasets. https://research.nvidia.com/labs/amri/projects/coherent3d

Related papers

Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework [53.251525710625096]
3D Super Resolution (3DSR)<n>Novel 3D Gaussian-splatting-based super-resolution framework.<n>We evaluate 3DSR on MipNeRF360 and LLFF data.
arXiv Detail & Related papers (2025-08-06T05:12:02Z)
VoluMe -- Authentic 3D Video Calls from Live Gaussian Splat Prediction [9.570954192915005]
We present the first method to predict 3D Gaussian reconstructions in real time from a single 2D webcam feed.<n>By conditioning the 3D representation on each video frame independently, our reconstruction faithfully recreates the input video from the captured viewpoint.<n>We show that our method delivers state-of-the-art accuracy in visual quality and stability metrics compared to existing methods.
arXiv Detail & Related papers (2025-07-28T20:07:55Z)
Bolt3D: Generating 3D Scenes in Seconds [77.592919825037]
Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.
arXiv Detail & Related papers (2025-03-18T17:24:19Z)
Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars [52.439807298140394]
We present Avat3r, which regresses a high-quality and animatable 3D head avatar from just a few input images. We make Large Reconstruction Models animatable and learn a powerful prior over 3D human heads from a large multi-view video dataset. We increase robustness by feeding input images with different expressions to our model during training, enabling the reconstruction of 3D head avatars from inconsistent inputs.
arXiv Detail & Related papers (2025-02-27T16:00:11Z)
3D$^2$-Actor: Learning Pose-Conditioned 3D-Aware Denoiser for Realistic Gaussian Avatar Modeling [37.11454674584874]
We introduce 3D$2$-Actor, a pose-conditioned 3D-aware human modeling pipeline that integrates 2D denoising and 3D rectifying steps. Experimental results demonstrate that 3D$2$-Actor excels in high-fidelity avatar modeling and robustly generalizes to novel poses.
arXiv Detail & Related papers (2024-12-16T09:37:52Z)
ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model [16.14713604672497]
ReconX is a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction challenge as a temporal generation task.<n>The proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition.<n> Guided by the condition, the video diffusion model then synthesizes video frames that are both detail-preserved and exhibit a high degree of 3D consistency.
arXiv Detail & Related papers (2024-08-29T17:59:40Z)
Coherent 3D Portrait Video Reconstruction via Triplane Fusion [21.381482393260406]
Per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user's appearance. We propose a new fusion-based method that fuses a personalized 3D subject prior with per-frame information. Our method achieves both state-of-the-art 3D reconstruction accuracy and temporal consistency on in-studio and in-the-wild datasets.
arXiv Detail & Related papers (2024-05-01T18:08:51Z)
Denoising Diffusion via Image-Based Rendering [54.20828696348574]
We introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes. First, we introduce a new neural scene representation, IB-planes, that can efficiently and accurately represent large 3D scenes. Second, we propose a denoising-diffusion framework to learn a prior over this novel 3D scene representation, using only 2D images.
arXiv Detail & Related papers (2024-02-05T19:00:45Z)
Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis [88.17520303867099]
One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio. We present Real3D-Potrait, a framework that improves the one-shot 3D reconstruction power with a large image-to-plane model. Experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos.
arXiv Detail & Related papers (2024-01-16T17:04:30Z)
WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space [77.92350895927922]
We propose WildFusion, a new approach to 3D-aware image synthesis based on latent diffusion models (LDMs) Our 3D-aware LDM is trained without any direct supervision from multiview images or 3D geometry. This opens up promising research avenues for scalable 3D-aware image synthesis and 3D content creation from in-the-wild image data.
arXiv Detail & Related papers (2023-11-22T18:25:51Z)
Appearance-Preserving 3D Convolution for Video-based Person Re-identification [61.677153482995564]
We propose AppearancePreserving 3D Convolution (AP3D), which is composed of two components: an Appearance-Preserving Module (APM) and a 3D convolution kernel. It is easy to combine AP3D with existing 3D ConvNets by simply replacing the original 3D convolution kernels with AP3Ds.
arXiv Detail & Related papers (2020-07-16T16:21:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.