O$^2$-Recon: Completing 3D Reconstruction of Occluded Objects in the Scene with a Pre-trained 2D Diffusion Model
- URL: http://arxiv.org/abs/2308.09591v3
- Date: Tue, 19 Mar 2024 06:37:32 GMT
- Title: O$^2$-Recon: Completing 3D Reconstruction of Occluded Objects in the Scene with a Pre-trained 2D Diffusion Model
- Authors: Yubin Hu, Sheng Ye, Wang Zhao, Matthieu Lin, Yuze He, Yu-Hui Wen, Ying He, Yong-Jin Liu,
- Abstract summary: Occlusion is a common issue in 3D reconstruction from RGB-D videos, often blocking the complete reconstruction of objects.
We propose a novel framework, empowered by a 2D diffusion-based in-painting model, to reconstruct complete surfaces for the hidden parts of objects.
- Score: 28.372289119872764
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Occlusion is a common issue in 3D reconstruction from RGB-D videos, often blocking the complete reconstruction of objects and presenting an ongoing problem. In this paper, we propose a novel framework, empowered by a 2D diffusion-based in-painting model, to reconstruct complete surfaces for the hidden parts of objects. Specifically, we utilize a pre-trained diffusion model to fill in the hidden areas of 2D images. Then we use these in-painted images to optimize a neural implicit surface representation for each instance for 3D reconstruction. Since creating the in-painting masks needed for this process is tricky, we adopt a human-in-the-loop strategy that involves very little human engagement to generate high-quality masks. Moreover, some parts of objects can be totally hidden because the videos are usually shot from limited perspectives. To ensure recovering these invisible areas, we develop a cascaded network architecture for predicting signed distance field, making use of different frequency bands of positional encoding and maintaining overall smoothness. Besides the commonly used rendering loss, Eikonal loss, and silhouette loss, we adopt a CLIP-based semantic consistency loss to guide the surface from unseen camera angles. Experiments on ScanNet scenes show that our proposed framework achieves state-of-the-art accuracy and completeness in object-level reconstruction from scene-level RGB-D videos. Code: https://github.com/THU-LYJ-Lab/O2-Recon.
Related papers
- Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction [51.3632308129838]
We present Total-Decom, a novel method for decomposed 3D reconstruction with minimal human interaction.
Our approach seamlessly integrates the Segment Anything Model (SAM) with hybrid implicit-explicit neural surface representations and a mesh-based region-growing technique for accurate 3D object decomposition.
We extensively evaluate our method on benchmark datasets and demonstrate its potential for downstream applications, such as animation and scene editing.
arXiv Detail & Related papers (2024-03-28T11:12:33Z) - Denoising Diffusion via Image-Based Rendering [54.20828696348574]
We introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes.
First, we introduce a new neural scene representation, IB-planes, that can efficiently and accurately represent large 3D scenes.
Second, we propose a denoising-diffusion framework to learn a prior over this novel 3D scene representation, using only 2D images.
arXiv Detail & Related papers (2024-02-05T19:00:45Z) - In-Hand 3D Object Reconstruction from a Monocular RGB Video [17.31419675163019]
Our work aims to reconstruct a 3D object that is held and rotated by a hand in front of a static RGB camera.
Previous methods that use implicit neural representations to recover the geometry of a generic hand-held object from multi-view images achieved compelling results in the visible part of the object.
arXiv Detail & Related papers (2023-12-27T06:19:25Z) - NeRFiller: Completing Scenes via Generative 3D Inpainting [113.18181179986172]
We propose NeRFiller, an approach that completes missing portions of a 3D capture via generative 3D inpainting.
In contrast to related works, we focus on completing scenes rather than deleting foreground objects.
arXiv Detail & Related papers (2023-12-07T18:59:41Z) - Inpaint3D: 3D Scene Content Generation using 2D Inpainting Diffusion [18.67196713834323]
This paper presents a novel approach to inpainting 3D regions of a scene, given masked multi-view images, by distilling a 2D diffusion model into a learned 3D scene representation (e.g. a NeRF)
We show that this 2D diffusion model can still serve as a generative prior in a 3D multi-view reconstruction problem where we optimize a NeRF using a combination of score distillation sampling and NeRF reconstruction losses.
Because our method can generate content to fill any 3D masked region, we additionally demonstrate 3D object completion, 3D object replacement, and 3D scene completion
arXiv Detail & Related papers (2023-12-06T19:30:04Z) - Self-Supervised Geometry-Aware Encoder for Style-Based 3D GAN Inversion [115.82306502822412]
StyleGAN has achieved great progress in 2D face reconstruction and semantic editing via image inversion and latent editing.
A corresponding generic 3D GAN inversion framework is still missing, limiting the applications of 3D face reconstruction and semantic editing.
We study the challenging problem of 3D GAN inversion where a latent code is predicted given a single face image to faithfully recover its 3D shapes and detailed textures.
arXiv Detail & Related papers (2022-12-14T18:49:50Z) - An Effective Loss Function for Generating 3D Models from Single 2D Image
without Rendering [0.0]
Differentiable rendering is a very successful technique that applies to a Single-View 3D Reconstruction.
Currents use losses based on pixels between a rendered image of some 3D reconstructed object and ground-truth images from given matched viewpoints to optimise parameters of the 3D shape.
We propose a novel effective loss function that evaluates how well the projections of reconstructed 3D point clouds cover the ground truth object's silhouette.
arXiv Detail & Related papers (2021-03-05T00:02:18Z) - Weakly Supervised Learning of Multi-Object 3D Scene Decompositions Using
Deep Shape Priors [69.02332607843569]
PriSMONet is a novel approach for learning Multi-Object 3D scene decomposition and representations from single images.
A recurrent encoder regresses a latent representation of 3D shape, pose and texture of each object from an input RGB image.
We evaluate the accuracy of our model in inferring 3D scene layout, demonstrate its generative capabilities, assess its generalization to real images, and point out benefits of the learned representation.
arXiv Detail & Related papers (2020-10-08T14:49:23Z) - AutoSweep: Recovering 3D Editable Objectsfrom a Single Photograph [54.701098964773756]
We aim to recover 3D objects with semantic parts and can be directly edited.
Our work makes an attempt towards recovering two types of primitive-shaped objects, namely, generalized cuboids and generalized cylinders.
Our algorithm can recover high quality 3D models and outperforms existing methods in both instance segmentation and 3D reconstruction.
arXiv Detail & Related papers (2020-05-27T12:16:24Z) - CoReNet: Coherent 3D scene reconstruction from a single RGB image [43.74240268086773]
We build on advances in deep learning to reconstruct the shape of a single object given only one RBG image as input.
We propose three extensions: (1) ray-traced skip connections that propagate local 2D information to the output 3D volume in a physically correct manner; (2) a hybrid 3D volume representation that enables building translation equivariant models; and (3) a reconstruction loss tailored to capture overall object geometry.
We reconstruct all objects jointly in one pass, producing a coherent reconstruction, where all objects live in a single consistent 3D coordinate frame relative to the camera and they do not intersect in 3D space.
arXiv Detail & Related papers (2020-04-27T17:53:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.