GenFusion: Closing the Loop between Reconstruction and Generation via Videos
- URL: http://arxiv.org/abs/2503.21219v2
- Date: Sat, 29 Mar 2025 12:18:02 GMT
- Title: GenFusion: Closing the Loop between Reconstruction and Generation via Videos
- Authors: Sibo Wu, Congrong Xu, Binbin Huang, Andreas Geiger, Anpei Chen,
- Abstract summary: We propose a reconstruction-driven video diffusion model that learns to condition video frames on artifact-prone RGB-D renderings.<n>We also propose a cyclical fusion pipeline that iteratively adds restoration frames from the generative model to the training set.
- Score: 24.195304481751602
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, 3D reconstruction and generation have demonstrated impressive novel view synthesis results, achieving high fidelity and efficiency. However, a notable conditioning gap can be observed between these two fields, e.g., scalable 3D scene reconstruction often requires densely captured views, whereas 3D generation typically relies on a single or no input view, which significantly limits their applications. We found that the source of this phenomenon lies in the misalignment between 3D constraints and generative priors. To address this problem, we propose a reconstruction-driven video diffusion model that learns to condition video frames on artifact-prone RGB-D renderings. Moreover, we propose a cyclical fusion pipeline that iteratively adds restoration frames from the generative model to the training set, enabling progressive expansion and addressing the viewpoint saturation limitations seen in previous reconstruction and generation pipelines. Our evaluation, including view synthesis from sparse view and masked input, validates the effectiveness of our approach. More details at https://genfusion.sibowu.com.
Related papers
- FlowR: Flowing from Sparse to Dense 3D Reconstructions [60.6368083163258]
We propose a flow matching model that learns a flow to connect novel view renderings to renderings that we expect from dense reconstructions.
Our model is trained on a novel dataset of 3.6M image pairs and can process up to 45 views at 540x960 resolution (91K tokens) on one H100 GPU in a single forward pass.
arXiv Detail & Related papers (2025-04-02T11:57:01Z) - Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models [65.90387371072413]
We introduce Difix3D+, a novel pipeline designed to enhance 3D reconstruction and novel-view synthesis.
At the core of our approach is Difix, a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views.
arXiv Detail & Related papers (2025-03-03T17:58:33Z) - Wonderland: Navigating 3D Scenes from a Single Image [43.99037613068823]
We introduce a large-scale reconstruction model that leverages latents from a video diffusion model to predict 3D Gaussian Splattings of scenes in a feed-forward manner.
We train the 3D reconstruction model to operate on the video latent space with a progressive learning strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes.
arXiv Detail & Related papers (2024-12-16T18:58:17Z) - Pragmatist: Multiview Conditional Diffusion Models for High-Fidelity 3D Reconstruction from Unposed Sparse Views [23.94629999419033]
Inferring 3D structures from sparse, unposed observations is challenging due to its unconstrained nature.<n>Recent methods propose to predict implicit representations directly from unposed inputs in a data-driven manner, achieving promising results.<n>We propose conditional novel view synthesis, aiming to generate complete observations from limited input views to facilitate reconstruction.
arXiv Detail & Related papers (2024-12-11T14:30:24Z) - 3DGS-Enhancer: Enhancing Unbounded 3D Gaussian Splatting with View-consistent 2D Diffusion Priors [13.191199172286508]
Novel-view synthesis aims to generate novel views of a scene from multiple input images or videos.
3DGS-Enhancer is a novel pipeline for enhancing the representation quality of 3DGS representations.
arXiv Detail & Related papers (2024-10-21T17:59:09Z) - ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model [16.14713604672497]
ReconX is a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction challenge as a temporal generation task.
The proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition.
Guided by the condition, the video diffusion model then synthesizes video frames that are both detail-preserved and exhibit a high degree of 3D consistency.
arXiv Detail & Related papers (2024-08-29T17:59:40Z) - MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation [54.27399121779011]
We present MVD-Fusion: a method for single-view 3D inference via generative modeling of multi-view-consistent RGB-D images.
We show that our approach can yield more accurate synthesis compared to recent state-of-the-art, including distillation-based 3D inference and prior multi-view generation methods.
arXiv Detail & Related papers (2024-04-04T17:59:57Z) - SGD: Street View Synthesis with Gaussian Splatting and Diffusion Prior [53.52396082006044]
Current methods struggle to maintain rendering quality at the viewpoint that deviates significantly from the training viewpoints.
This issue stems from the sparse training views captured by a fixed camera on a moving vehicle.
We propose a novel approach that enhances the capacity of 3DGS by leveraging prior from a Diffusion Model.
arXiv Detail & Related papers (2024-03-29T09:20:29Z) - Denoising Diffusion via Image-Based Rendering [54.20828696348574]
We introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes.
First, we introduce a new neural scene representation, IB-planes, that can efficiently and accurately represent large 3D scenes.
Second, we propose a denoising-diffusion framework to learn a prior over this novel 3D scene representation, using only 2D images.
arXiv Detail & Related papers (2024-02-05T19:00:45Z) - UNeR3D: Versatile and Scalable 3D RGB Point Cloud Generation from 2D
Images in Unsupervised Reconstruction [2.7848140839111903]
UNeR3D sets a new standard for generating detailed 3D reconstructions solely from 2D views.
Our model significantly cuts down the training costs tied to supervised approaches.
UNeR3D ensures seamless color transitions, enhancing visual fidelity.
arXiv Detail & Related papers (2023-12-10T15:18:55Z) - ReconFusion: 3D Reconstruction with Diffusion Priors [104.73604630145847]
We present ReconFusion to reconstruct real-world scenes using only a few photos.
Our approach leverages a diffusion prior for novel view synthesis, trained on synthetic and multiview datasets.
Our method synthesizes realistic geometry and texture in underconstrained regions while preserving the appearance of observed regions.
arXiv Detail & Related papers (2023-12-05T18:59:58Z) - High-fidelity 3D GAN Inversion by Pseudo-multi-view Optimization [51.878078860524795]
We present a high-fidelity 3D generative adversarial network (GAN) inversion framework that can synthesize photo-realistic novel views.
Our approach enables high-fidelity 3D rendering from a single image, which is promising for various applications of AI-generated 3D content.
arXiv Detail & Related papers (2022-11-28T18:59:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.