One-Shot Neural Fields for 3D Object Understanding
- URL: http://arxiv.org/abs/2210.12126v3
- Date: Wed, 9 Aug 2023 00:30:08 GMT
- Title: One-Shot Neural Fields for 3D Object Understanding
- Authors: Valts Blukis, Taeyeop Lee, Jonathan Tremblay, Bowen Wen, In So Kweon,
Kuk-Jin Yoon, Dieter Fox, Stan Birchfield
- Abstract summary: We present a unified and compact scene representation for robotics.
Each object in the scene is depicted by a latent code capturing geometry and appearance.
This representation can be decoded for various tasks such as novel view rendering, 3D reconstruction, and stable grasp prediction.
- Score: 112.32255680399399
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a unified and compact scene representation for robotics, where
each object in the scene is depicted by a latent code capturing geometry and
appearance. This representation can be decoded for various tasks such as novel
view rendering, 3D reconstruction (e.g. recovering depth, point clouds, or
voxel maps), collision checking, and stable grasp prediction. We build our
representation from a single RGB input image at test time by leveraging recent
advances in Neural Radiance Fields (NeRF) that learn category-level priors on
large multiview datasets, then fine-tune on novel objects from one or few
views. We expand the NeRF model for additional grasp outputs and explore ways
to leverage this representation for robotics. At test-time, we build the
representation from a single RGB input image observing the scene from only one
viewpoint. We find that the recovered representation allows rendering from
novel views, including of occluded object parts, and also for predicting
successful stable grasps. Grasp poses can be directly decoded from our latent
representation with an implicit grasp decoder. We experimented in both
simulation and real world and demonstrated the capability for robust robotic
grasping using such compact representation. Website:
https://nerfgrasp.github.io
Related papers
- DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features [65.8738034806085]
DistillNeRF is a self-supervised learning framework for understanding 3D environments in autonomous driving scenes.
Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs.
arXiv Detail & Related papers (2024-06-17T21:15:13Z) - Neural Groundplans: Persistent Neural Scene Representations from a
Single Image [90.04272671464238]
We present a method to map 2D image observations of a scene to a persistent 3D scene representation.
We propose conditional neural groundplans as persistent and memory-efficient scene representations.
arXiv Detail & Related papers (2022-07-22T17:41:24Z) - Vision Transformer for NeRF-Based View Synthesis from a Single Input
Image [49.956005709863355]
We propose to leverage both the global and local features to form an expressive 3D representation.
To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering.
Our method can render novel views from only a single input image and generalize across multiple object categories using a single model.
arXiv Detail & Related papers (2022-07-12T17:52:04Z) - Decomposing 3D Scenes into Objects via Unsupervised Volume Segmentation [26.868351498722884]
We present ObSuRF, a method which turns a single image of a scene into a 3D model represented as a set of Neural Radiance Fields (NeRFs)
We make learning more computationally efficient by deriving a novel loss, which allows training NeRFs on RGB-D inputs without explicit ray marching.
arXiv Detail & Related papers (2021-04-02T16:59:29Z) - Neural Body: Implicit Neural Representations with Structured Latent
Codes for Novel View Synthesis of Dynamic Humans [56.63912568777483]
This paper addresses the challenge of novel view synthesis for a human performer from a very sparse set of camera views.
We propose Neural Body, a new human body representation which assumes that the learned neural representations at different frames share the same set of latent codes anchored to a deformable mesh.
Experiments on ZJU-MoCap show that our approach outperforms prior works by a large margin in terms of novel view synthesis quality.
arXiv Detail & Related papers (2020-12-31T18:55:38Z) - pixelNeRF: Neural Radiance Fields from One or Few Images [20.607712035278315]
pixelNeRF is a learning framework that predicts a continuous neural scene representation conditioned on one or few input images.
We conduct experiments on ShapeNet benchmarks for single image novel view synthesis tasks with held-out objects.
In all cases, pixelNeRF outperforms current state-of-the-art baselines for novel view synthesis and single image 3D reconstruction.
arXiv Detail & Related papers (2020-12-03T18:59:54Z) - Weakly Supervised Learning of Multi-Object 3D Scene Decompositions Using
Deep Shape Priors [69.02332607843569]
PriSMONet is a novel approach for learning Multi-Object 3D scene decomposition and representations from single images.
A recurrent encoder regresses a latent representation of 3D shape, pose and texture of each object from an input RGB image.
We evaluate the accuracy of our model in inferring 3D scene layout, demonstrate its generative capabilities, assess its generalization to real images, and point out benefits of the learned representation.
arXiv Detail & Related papers (2020-10-08T14:49:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.