Neural Groundplans: Persistent Neural Scene Representations from a
Single Image
- URL: http://arxiv.org/abs/2207.11232v2
- Date: Mon, 10 Apr 2023 00:49:55 GMT
- Title: Neural Groundplans: Persistent Neural Scene Representations from a
Single Image
- Authors: Prafull Sharma, Ayush Tewari, Yilun Du, Sergey Zakharov, Rares Ambrus,
Adrien Gaidon, William T. Freeman, Fredo Durand, Joshua B. Tenenbaum, Vincent
Sitzmann
- Abstract summary: We present a method to map 2D image observations of a scene to a persistent 3D scene representation.
We propose conditional neural groundplans as persistent and memory-efficient scene representations.
- Score: 90.04272671464238
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a method to map 2D image observations of a scene to a persistent
3D scene representation, enabling novel view synthesis and disentangled
representation of the movable and immovable components of the scene. Motivated
by the bird's-eye-view (BEV) representation commonly used in vision and
robotics, we propose conditional neural groundplans, ground-aligned 2D feature
grids, as persistent and memory-efficient scene representations. Our method is
trained self-supervised from unlabeled multi-view observations using
differentiable rendering, and learns to complete geometry and appearance of
occluded regions. In addition, we show that we can leverage multi-view videos
at training time to learn to separately reconstruct static and movable
components of the scene from a single image at test time. The ability to
separately reconstruct movable objects enables a variety of downstream tasks
using simple heuristics, such as extraction of object-centric 3D
representations, novel view synthesis, instance-level segmentation, 3D bounding
box prediction, and scene editing. This highlights the value of neural
groundplans as a backbone for efficient 3D scene understanding models.
Related papers
- DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features [65.8738034806085]
DistillNeRF is a self-supervised learning framework for understanding 3D environments in autonomous driving scenes.
Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs.
arXiv Detail & Related papers (2024-06-17T21:15:13Z) - SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections [49.802462165826554]
We present SceneDreamer, an unconditional generative model for unbounded 3D scenes.
Our framework is learned from in-the-wild 2D image collections only, without any 3D annotations.
arXiv Detail & Related papers (2023-02-02T18:59:16Z) - Panoptic Lifting for 3D Scene Understanding with Neural Fields [32.59498558663363]
We propose a novel approach for learning panoptic 3D representations from images of in-the-wild scenes.
Our method requires only machine-generated 2D panoptic segmentation masks inferred from a pre-trained network.
Experimental results validate our approach on the challenging Hypersim, Replica, and ScanNet datasets.
arXiv Detail & Related papers (2022-12-19T19:15:36Z) - Neural Feature Fusion Fields: 3D Distillation of Self-Supervised 2D
Image Representations [92.88108411154255]
We present a method that improves dense 2D image feature extractors when the latter are applied to the analysis of multiple images reconstructible as a 3D scene.
We show that our method not only enables semantic understanding in the context of scene-specific neural fields without the use of manual labels, but also consistently improves over the self-supervised 2D baselines.
arXiv Detail & Related papers (2022-09-07T23:24:09Z) - Vision Transformer for NeRF-Based View Synthesis from a Single Input
Image [49.956005709863355]
We propose to leverage both the global and local features to form an expressive 3D representation.
To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering.
Our method can render novel views from only a single input image and generalize across multiple object categories using a single model.
arXiv Detail & Related papers (2022-07-12T17:52:04Z) - STaR: Self-supervised Tracking and Reconstruction of Rigid Objects in
Motion with Neural Rendering [9.600908665766465]
We present STaR, a novel method that performs Self-supervised Tracking and Reconstruction of dynamic scenes with rigid motion from multi-view RGB videos without any manual annotation.
We show that our method can render photorealistic novel views, where novelty is measured on both spatial and temporal axes.
arXiv Detail & Related papers (2020-12-22T23:45:28Z) - Weakly Supervised Learning of Multi-Object 3D Scene Decompositions Using
Deep Shape Priors [69.02332607843569]
PriSMONet is a novel approach for learning Multi-Object 3D scene decomposition and representations from single images.
A recurrent encoder regresses a latent representation of 3D shape, pose and texture of each object from an input RGB image.
We evaluate the accuracy of our model in inferring 3D scene layout, demonstrate its generative capabilities, assess its generalization to real images, and point out benefits of the learned representation.
arXiv Detail & Related papers (2020-10-08T14:49:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.