Learning Multi-Object Dynamics with Compositional Neural Radiance Fields
- URL: http://arxiv.org/abs/2202.11855v1
- Date: Thu, 24 Feb 2022 01:31:29 GMT
- Title: Learning Multi-Object Dynamics with Compositional Neural Radiance Fields
- Authors: Danny Driess, Zhiao Huang, Yunzhu Li, Russ Tedrake, Marc Toussaint
- Abstract summary: We present a method to learn compositional predictive models from image observations based on implicit object encoders, Neural Radiance Fields (NeRFs), and graph neural networks.
NeRFs have become a popular choice for representing scenes due to their strong 3D prior.
For planning, we utilize RRTs in the learned latent space, where we can exploit our model and the implicit object encoder to make sampling the latent space informative and more efficient.
- Score: 63.424469458529906
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a method to learn compositional predictive models from image
observations based on implicit object encoders, Neural Radiance Fields (NeRFs),
and graph neural networks. A central question in learning dynamic models from
sensor observations is on which representations predictions should be
performed. NeRFs have become a popular choice for representing scenes due to
their strong 3D prior. However, most NeRF approaches are trained on a single
scene, representing the whole scene with a global model, making generalization
to novel scenes, containing different numbers of objects, challenging. Instead,
we present a compositional, object-centric auto-encoder framework that maps
multiple views of the scene to a \emph{set} of latent vectors representing each
object separately. The latent vectors parameterize individual NeRF models from
which the scene can be reconstructed and rendered from novel viewpoints. We
train a graph neural network dynamics model in the latent space to achieve
compositionality for dynamics prediction. A key feature of our approach is that
the learned 3D information of the scene through the NeRF model enables us to
incorporate structural priors in learning the dynamics models, making long-term
predictions more stable. The model can further be used to synthesize new scenes
from individual object observations. For planning, we utilize RRTs in the
learned latent space, where we can exploit our model and the implicit object
encoder to make sampling the latent space informative and more efficient. In
the experiments, we show that the model outperforms several baselines on a
pushing task containing many objects. Video:
https://dannydriess.github.io/compnerfdyn/
Related papers
- DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features [65.8738034806085]
DistillNeRF is a self-supervised learning framework for understanding 3D environments in autonomous driving scenes.
Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs.
arXiv Detail & Related papers (2024-06-17T21:15:13Z) - NSLF-OL: Online Learning of Neural Surface Light Fields alongside
Real-time Incremental 3D Reconstruction [0.76146285961466]
The paper proposes a novel Neural Surface Light Fields model that copes with the small range of view directions while producing a good result in unseen directions.
Our model learns online the Neural Surface Light Fields (NSLF) aside from real-time 3D reconstruction with a sequential data stream as the shared input.
In addition to online training, our model also provides real-time rendering after completing the data stream for visualization.
arXiv Detail & Related papers (2023-04-29T15:41:15Z) - 3D-IntPhys: Towards More Generalized 3D-grounded Visual Intuitive
Physics under Challenging Scenes [68.66237114509264]
We present a framework capable of learning 3D-grounded visual intuitive physics models from videos of complex scenes with fluids.
We show our model can make long-horizon future predictions by learning from raw images and significantly outperforms models that do not employ an explicit 3D representation space.
arXiv Detail & Related papers (2023-04-22T19:28:49Z) - Neural Groundplans: Persistent Neural Scene Representations from a
Single Image [90.04272671464238]
We present a method to map 2D image observations of a scene to a persistent 3D scene representation.
We propose conditional neural groundplans as persistent and memory-efficient scene representations.
arXiv Detail & Related papers (2022-07-22T17:41:24Z) - Learning Dynamic View Synthesis With Few RGBD Cameras [60.36357774688289]
We propose to utilize RGBD cameras to synthesize free-viewpoint videos of dynamic indoor scenes.
We generate point clouds from RGBD frames and then render them into free-viewpoint videos via a neural feature.
We introduce a simple Regional Depth-Inpainting module that adaptively inpaints missing depth values to render complete novel views.
arXiv Detail & Related papers (2022-04-22T03:17:35Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z) - pixelNeRF: Neural Radiance Fields from One or Few Images [20.607712035278315]
pixelNeRF is a learning framework that predicts a continuous neural scene representation conditioned on one or few input images.
We conduct experiments on ShapeNet benchmarks for single image novel view synthesis tasks with held-out objects.
In all cases, pixelNeRF outperforms current state-of-the-art baselines for novel view synthesis and single image 3D reconstruction.
arXiv Detail & Related papers (2020-12-03T18:59:54Z) - 3D-OES: Viewpoint-Invariant Object-Factorized Environment Simulators [24.181604511269096]
We propose an action-conditioned dynamics model that predicts scene changes caused by object and agent interactions in a viewpoint-in 3D neural scene representation space.
In this space, objects do not interfere with one another and their appearance persists over time and across viewpoints.
We show our model generalizes well its predictions across varying number and appearances of interacting objects as well as across camera viewpoints.
arXiv Detail & Related papers (2020-11-12T16:15:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.