3DP3: 3D Scene Perception via Probabilistic Programming
- URL: http://arxiv.org/abs/2111.00312v1
- Date: Sat, 30 Oct 2021 19:10:34 GMT
- Title: 3DP3: 3D Scene Perception via Probabilistic Programming
- Authors: Nishad Gothoskar, Marco Cusumano-Towner, Ben Zinberg, Matin
Ghavamizadeh, Falk Pollok, Austin Garrett, Joshua B. Tenenbaum, Dan
Gutfreund, Vikash K. Mansinghka
- Abstract summary: 3DP3 is a framework for inverse graphics that uses inference in a structured generative model of objects, scenes, and images.
Our results demonstrate that 3DP3 is more accurate at 6DoF object pose estimation from real images than deep learning baselines.
- Score: 28.491817202574932
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present 3DP3, a framework for inverse graphics that uses inference in a
structured generative model of objects, scenes, and images. 3DP3 uses (i) voxel
models to represent the 3D shape of objects, (ii) hierarchical scene graphs to
decompose scenes into objects and the contacts between them, and (iii) depth
image likelihoods based on real-time graphics. Given an observed RGB-D image,
3DP3's inference algorithm infers the underlying latent 3D scene, including the
object poses and a parsimonious joint parametrization of these poses, using
fast bottom-up pose proposals, novel involutive MCMC updates of the scene graph
structure, and, optionally, neural object detectors and pose estimators. We
show that 3DP3 enables scene understanding that is aware of 3D shape,
occlusion, and contact structure. Our results demonstrate that 3DP3 is more
accurate at 6DoF object pose estimation from real images than deep learning
baselines and shows better generalization to challenging scenes with novel
viewpoints, contact, and partial observability.
Related papers
- Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-labeling [9.440800948514449]
We propose a weakly-supervised 3D scene graph generation method via Visual-Linguistic Assisted Pseudo-labeling.
Our 3D-VLAP exploits the superior ability of current large-scale visual-linguistic models to align the semantics between texts and 2D images.
We design an edge self-attention based graph neural network to generate scene graphs of 3D point cloud scenes.
arXiv Detail & Related papers (2024-04-03T07:30:09Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - SceneWiz3D: Towards Text-guided 3D Scene Composition [134.71933134180782]
Existing approaches either leverage large text-to-image models to optimize a 3D representation or train 3D generators on object-centric datasets.
We introduce SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes from text.
arXiv Detail & Related papers (2023-12-13T18:59:30Z) - 3D Neural Embedding Likelihood: Probabilistic Inverse Graphics for
Robust 6D Pose Estimation [50.15926681475939]
Inverse graphics aims to infer the 3D scene structure from 2D images.
We introduce probabilistic modeling to quantify uncertainty and achieve robustness in 6D pose estimation tasks.
3DNEL effectively combines learned neural embeddings from RGB with depth information to improve robustness in sim-to-real 6D object pose estimation from RGB-D images.
arXiv Detail & Related papers (2023-02-07T20:48:35Z) - Neural Groundplans: Persistent Neural Scene Representations from a
Single Image [90.04272671464238]
We present a method to map 2D image observations of a scene to a persistent 3D scene representation.
We propose conditional neural groundplans as persistent and memory-efficient scene representations.
arXiv Detail & Related papers (2022-07-22T17:41:24Z) - CoCoNets: Continuous Contrastive 3D Scene Representations [21.906643302668716]
This paper explores self-supervised learning of amodal 3D feature representations from RGB and RGB-D posed images and videos.
We show the resulting 3D visual feature representations effectively scale across objects and scenes, imagine information occluded or missing from the input viewpoints, track objects over time, align semantically related objects in 3D, and improve 3D object detection.
arXiv Detail & Related papers (2021-04-08T15:50:47Z) - Weakly Supervised Learning of Multi-Object 3D Scene Decompositions Using
Deep Shape Priors [69.02332607843569]
PriSMONet is a novel approach for learning Multi-Object 3D scene decomposition and representations from single images.
A recurrent encoder regresses a latent representation of 3D shape, pose and texture of each object from an input RGB image.
We evaluate the accuracy of our model in inferring 3D scene layout, demonstrate its generative capabilities, assess its generalization to real images, and point out benefits of the learned representation.
arXiv Detail & Related papers (2020-10-08T14:49:23Z) - Equivariant Neural Rendering [22.95150913645939]
We propose a framework for learning neural scene representations directly from images, without 3D supervision.
Our key insight is that 3D structure can be imposed by ensuring that the learned representation transforms like a real 3D scene.
Our formulation allows us to infer and render scenes in real time while achieving comparable results to models requiring minutes for inference.
arXiv Detail & Related papers (2020-06-13T12:25:07Z) - Learning 3D Semantic Scene Graphs from 3D Indoor Reconstructions [94.17683799712397]
We focus on scene graphs, a data structure that organizes the entities of a scene in a graph.
We propose a learned method that regresses a scene graph from the point cloud of a scene.
We show the application of our method in a domain-agnostic retrieval task, where graphs serve as an intermediate representation for 3D-3D and 2D-3D matching.
arXiv Detail & Related papers (2020-04-08T12:25:25Z) - Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images using
a View-based Representation [20.788952043643906]
Pix2Shape generates 3D scenes from a single input image without supervision.
We show that Pix2Shape learns a consistent scene representation in its encoded latent space.
We evaluate Pix2Shape with experiments on the ShapeNet dataset.
arXiv Detail & Related papers (2020-03-23T03:01:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.