Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images using
a View-based Representation
- URL: http://arxiv.org/abs/2003.14166v2
- Date: Fri, 17 Apr 2020 13:22:58 GMT
- Title: Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images using
a View-based Representation
- Authors: Sai Rajeswar, Fahim Mannan, Florian Golemo, J\'er\^ome
Parent-L\'evesque, David Vazquez, Derek Nowrouzezahrai, Aaron Courville
- Abstract summary: Pix2Shape generates 3D scenes from a single input image without supervision.
We show that Pix2Shape learns a consistent scene representation in its encoded latent space.
We evaluate Pix2Shape with experiments on the ShapeNet dataset.
- Score: 20.788952043643906
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We infer and generate three-dimensional (3D) scene information from a single
input image and without supervision. This problem is under-explored, with most
prior work relying on supervision from, e.g., 3D ground-truth, multiple images
of a scene, image silhouettes or key-points. We propose Pix2Shape, an approach
to solve this problem with four components: (i) an encoder that infers the
latent 3D representation from an image, (ii) a decoder that generates an
explicit 2.5D surfel-based reconstruction of a scene from the latent code (iii)
a differentiable renderer that synthesizes a 2D image from the surfel
representation, and (iv) a critic network trained to discriminate between
images generated by the decoder-renderer and those from a training
distribution. Pix2Shape can generate complex 3D scenes that scale with the
view-dependent on-screen resolution, unlike representations that capture
world-space resolution, i.e., voxels or meshes. We show that Pix2Shape learns a
consistent scene representation in its encoded latent space and that the
decoder can then be applied to this latent representation in order to
synthesize the scene from a novel viewpoint. We evaluate Pix2Shape with
experiments on the ShapeNet dataset as well as on a novel benchmark we
developed, called 3D-IQTT, to evaluate models based on their ability to enable
3d spatial reasoning. Qualitative and quantitative evaluation demonstrate
Pix2Shape's ability to solve scene reconstruction, generation, and
understanding tasks.
Related papers
- Denoising Diffusion via Image-Based Rendering [54.20828696348574]
We introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes.
First, we introduce a new neural scene representation, IB-planes, that can efficiently and accurately represent large 3D scenes.
Second, we propose a denoising-diffusion framework to learn a prior over this novel 3D scene representation, using only 2D images.
arXiv Detail & Related papers (2024-02-05T19:00:45Z) - Blocks2World: Controlling Realistic Scenes with Editable Primitives [5.541644538483947]
We present Blocks2World, a novel method for 3D scene rendering and editing.
Our technique begins by extracting 3D parallelepipeds from various objects in a given scene using convex decomposition.
The next stage involves training a conditioned model that learns to generate images from the 2D-rendered convex primitives.
arXiv Detail & Related papers (2023-07-07T21:38:50Z) - SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections [49.802462165826554]
We present SceneDreamer, an unconditional generative model for unbounded 3D scenes.
Our framework is learned from in-the-wild 2D image collections only, without any 3D annotations.
arXiv Detail & Related papers (2023-02-02T18:59:16Z) - Panoptic Lifting for 3D Scene Understanding with Neural Fields [32.59498558663363]
We propose a novel approach for learning panoptic 3D representations from images of in-the-wild scenes.
Our method requires only machine-generated 2D panoptic segmentation masks inferred from a pre-trained network.
Experimental results validate our approach on the challenging Hypersim, Replica, and ScanNet datasets.
arXiv Detail & Related papers (2022-12-19T19:15:36Z) - 3inGAN: Learning a 3D Generative Model from Images of a Self-similar
Scene [34.2144933185175]
3inGAN is an unconditional 3D generative model trained from 2D images of a single self-similar 3D scene.
We show results on semi-stochastic scenes of varying scale and complexity, obtained from real and synthetic sources.
arXiv Detail & Related papers (2022-11-27T18:03:21Z) - CompNVS: Novel View Synthesis with Scene Completion [83.19663671794596]
We propose a generative pipeline performing on a sparse grid-based neural scene representation to complete unobserved scene parts.
We process encoded image features in 3D space with a geometry completion network and a subsequent texture inpainting network to extrapolate the missing area.
Photorealistic image sequences can be finally obtained via consistency-relevant differentiable rendering.
arXiv Detail & Related papers (2022-07-23T09:03:13Z) - Neural Groundplans: Persistent Neural Scene Representations from a
Single Image [90.04272671464238]
We present a method to map 2D image observations of a scene to a persistent 3D scene representation.
We propose conditional neural groundplans as persistent and memory-efficient scene representations.
arXiv Detail & Related papers (2022-07-22T17:41:24Z) - DRaCoN -- Differentiable Rasterization Conditioned Neural Radiance
Fields for Articulated Avatars [92.37436369781692]
We present DRaCoN, a framework for learning full-body volumetric avatars.
It exploits the advantages of both the 2D and 3D neural rendering techniques.
Experiments on the challenging ZJU-MoCap and Human3.6M datasets indicate that DRaCoN outperforms state-of-the-art methods.
arXiv Detail & Related papers (2022-03-29T17:59:15Z) - 3DP3: 3D Scene Perception via Probabilistic Programming [28.491817202574932]
3DP3 is a framework for inverse graphics that uses inference in a structured generative model of objects, scenes, and images.
Our results demonstrate that 3DP3 is more accurate at 6DoF object pose estimation from real images than deep learning baselines.
arXiv Detail & Related papers (2021-10-30T19:10:34Z) - Weakly Supervised Learning of Multi-Object 3D Scene Decompositions Using
Deep Shape Priors [69.02332607843569]
PriSMONet is a novel approach for learning Multi-Object 3D scene decomposition and representations from single images.
A recurrent encoder regresses a latent representation of 3D shape, pose and texture of each object from an input RGB image.
We evaluate the accuracy of our model in inferring 3D scene layout, demonstrate its generative capabilities, assess its generalization to real images, and point out benefits of the learned representation.
arXiv Detail & Related papers (2020-10-08T14:49:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.