ObPose: Leveraging Pose for Object-Centric Scene Inference and
Generation in 3D
- URL: http://arxiv.org/abs/2206.03591v3
- Date: Fri, 9 Jun 2023 20:18:14 GMT
- Title: ObPose: Leveraging Pose for Object-Centric Scene Inference and
Generation in 3D
- Authors: Yizhe Wu, Oiwi Parker Jones, Ingmar Posner
- Abstract summary: ObPose is an unsupervised object-centric inference and generation model.
It learns 3D-structured latent representations from RGB-D scenes.
ObPose is evaluated quantitatively on the YCB, MultiShapeNet, and CLEVR datatasets.
- Score: 21.700203922407496
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present ObPose, an unsupervised object-centric inference and generation
model which learns 3D-structured latent representations from RGB-D scenes.
Inspired by prior art in 2D representation learning, ObPose considers a
factorised latent space, separately encoding object location (where) and
appearance (what). ObPose further leverages an object's pose (i.e. location and
orientation), defined via a minimum volume principle, as a novel inductive bias
for learning the where component. To achieve this, we propose an efficient,
voxelised approximation approach to recover the object shape directly from a
neural radiance field (NeRF). As a consequence, ObPose models each scene as a
composition of NeRFs, richly representing individual objects. To evaluate the
quality of the learned representations, ObPose is evaluated quantitatively on
the YCB, MultiShapeNet, and CLEVR datatasets for unsupervised scene
segmentation, outperforming the current state-of-the-art in 3D scene inference
(ObSuRF) by a significant margin. Generative results provide qualitative
demonstration that the same ObPose model can both generate novel scenes and
flexibly edit the objects in them. These capacities again reflect the quality
of the learned latents and the benefits of disentangling the where and what
components of a scene. Key design choices made in the ObPose encoder are
validated with ablations.
Related papers
- LocaliseBot: Multi-view 3D object localisation with differentiable
rendering for robot grasping [9.690844449175948]
We focus on object pose estimation.
Our approach relies on three pieces of information: multiple views of the object, the camera's parameters at those viewpoints, and 3D CAD models of objects.
We show that the estimated object pose results in 99.65% grasp accuracy with the ground truth grasp candidates.
arXiv Detail & Related papers (2023-11-14T14:27:53Z) - Anything-3D: Towards Single-view Anything Reconstruction in the Wild [61.090129285205805]
We introduce Anything-3D, a methodical framework that ingeniously combines a series of visual-language models and the Segment-Anything object segmentation model.
Our approach employs a BLIP model to generate textural descriptions, utilize the Segment-Anything model for the effective extraction of objects of interest, and leverages a text-to-image diffusion model to lift object into a neural radiance field.
arXiv Detail & Related papers (2023-04-19T16:39:51Z) - MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare [84.80956484848505]
MegaPose is a method to estimate the 6D pose of novel objects, that is, objects unseen during training.
We present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects.
Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner.
arXiv Detail & Related papers (2022-12-13T19:30:03Z) - Neural Groundplans: Persistent Neural Scene Representations from a
Single Image [90.04272671464238]
We present a method to map 2D image observations of a scene to a persistent 3D scene representation.
We propose conditional neural groundplans as persistent and memory-efficient scene representations.
arXiv Detail & Related papers (2022-07-22T17:41:24Z) - Towards High-Fidelity Single-view Holistic Reconstruction of Indoor
Scenes [50.317223783035075]
We present a new framework to reconstruct holistic 3D indoor scenes from single-view images.
We propose an instance-aligned implicit function (InstPIFu) for detailed object reconstruction.
Our code and model will be made publicly available.
arXiv Detail & Related papers (2022-07-18T14:54:57Z) - 3DP3: 3D Scene Perception via Probabilistic Programming [28.491817202574932]
3DP3 is a framework for inverse graphics that uses inference in a structured generative model of objects, scenes, and images.
Our results demonstrate that 3DP3 is more accurate at 6DoF object pose estimation from real images than deep learning baselines.
arXiv Detail & Related papers (2021-10-30T19:10:34Z) - Object Wake-up: 3-D Object Reconstruction, Animation, and in-situ
Rendering from a Single Image [58.69732754597448]
Given a picture of a chair, could we extract the 3-D shape of the chair, animate its plausible articulations and motions, and render in-situ in its original image space?
We devise an automated approach to extract and manipulate articulated objects in single images.
arXiv Detail & Related papers (2021-08-05T16:20:12Z) - DSC-PoseNet: Learning 6DoF Object Pose Estimation via Dual-scale
Consistency [43.09728251735362]
We present a two-step pose estimation framework to attain 6DoF object poses from 2D object bounding-boxes.
In the first step, the framework learns to segment objects from real and synthetic data.
In the second step, we design a dual-scale pose estimation network, namely DSC-PoseNet.
Our method outperforms state-of-the-art models trained on synthetic data by a large margin.
arXiv Detail & Related papers (2021-04-08T10:19:35Z) - Weakly Supervised Learning of Multi-Object 3D Scene Decompositions Using
Deep Shape Priors [69.02332607843569]
PriSMONet is a novel approach for learning Multi-Object 3D scene decomposition and representations from single images.
A recurrent encoder regresses a latent representation of 3D shape, pose and texture of each object from an input RGB image.
We evaluate the accuracy of our model in inferring 3D scene layout, demonstrate its generative capabilities, assess its generalization to real images, and point out benefits of the learned representation.
arXiv Detail & Related papers (2020-10-08T14:49:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.