3D-IntPhys: Towards More Generalized 3D-grounded Visual Intuitive
Physics under Challenging Scenes
- URL: http://arxiv.org/abs/2304.11470v1
- Date: Sat, 22 Apr 2023 19:28:49 GMT
- Title: 3D-IntPhys: Towards More Generalized 3D-grounded Visual Intuitive
Physics under Challenging Scenes
- Authors: Haotian Xue, Antonio Torralba, Joshua B. Tenenbaum, Daniel LK Yamins,
Yunzhu Li, Hsiao-Yu Tung
- Abstract summary: We present a framework capable of learning 3D-grounded visual intuitive physics models from videos of complex scenes with fluids.
We show our model can make long-horizon future predictions by learning from raw images and significantly outperforms models that do not employ an explicit 3D representation space.
- Score: 68.66237114509264
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given a visual scene, humans have strong intuitions about how a scene can
evolve over time under given actions. The intuition, often termed visual
intuitive physics, is a critical ability that allows us to make effective plans
to manipulate the scene to achieve desired outcomes without relying on
extensive trial and error. In this paper, we present a framework capable of
learning 3D-grounded visual intuitive physics models from videos of complex
scenes with fluids. Our method is composed of a conditional Neural Radiance
Field (NeRF)-style visual frontend and a 3D point-based dynamics prediction
backend, using which we can impose strong relational and structural inductive
bias to capture the structure of the underlying environment. Unlike existing
intuitive point-based dynamics works that rely on the supervision of dense
point trajectory from simulators, we relax the requirements and only assume
access to multi-view RGB images and (imperfect) instance masks acquired using
color prior. This enables the proposed model to handle scenarios where accurate
point estimation and tracking are hard or impossible. We generate datasets
including three challenging scenarios involving fluid, granular materials, and
rigid objects in the simulation. The datasets do not include any dense particle
information so most previous 3D-based intuitive physics pipelines can barely
deal with that. We show our model can make long-horizon future predictions by
learning from raw images and significantly outperforms models that do not
employ an explicit 3D representation space. We also show that once trained, our
model can achieve strong generalization in complex scenarios under extrapolate
settings.
Related papers
- Latent Intuitive Physics: Learning to Transfer Hidden Physics from A 3D Video [58.043569985784806]
We introduce latent intuitive physics, a transfer learning framework for physics simulation.
It can infer hidden properties of fluids from a single 3D video and simulate the observed fluid in novel scenes.
We validate our model in three ways: (i) novel scene simulation with the learned visual-world physics, (ii) future prediction of the observed fluid dynamics, and (iii) supervised particle simulation.
arXiv Detail & Related papers (2024-06-18T16:37:44Z) - Learning 3D Particle-based Simulators from RGB-D Videos [15.683877597215494]
We propose a method for learning simulators directly from observations.
Visual Particle Dynamics (VPD) jointly learns a latent particle-based representation of 3D scenes.
Unlike existing 2D video prediction models, VPD's 3D structure enables scene editing and long-term predictions.
arXiv Detail & Related papers (2023-12-08T20:45:34Z) - 3D Neural Embedding Likelihood: Probabilistic Inverse Graphics for
Robust 6D Pose Estimation [50.15926681475939]
Inverse graphics aims to infer the 3D scene structure from 2D images.
We introduce probabilistic modeling to quantify uncertainty and achieve robustness in 6D pose estimation tasks.
3DNEL effectively combines learned neural embeddings from RGB with depth information to improve robustness in sim-to-real 6D object pose estimation from RGB-D images.
arXiv Detail & Related papers (2023-02-07T20:48:35Z) - Learning Multi-Object Dynamics with Compositional Neural Radiance Fields [63.424469458529906]
We present a method to learn compositional predictive models from image observations based on implicit object encoders, Neural Radiance Fields (NeRFs), and graph neural networks.
NeRFs have become a popular choice for representing scenes due to their strong 3D prior.
For planning, we utilize RRTs in the learned latent space, where we can exploit our model and the implicit object encoder to make sampling the latent space informative and more efficient.
arXiv Detail & Related papers (2022-02-24T01:31:29Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z) - 3D-OES: Viewpoint-Invariant Object-Factorized Environment Simulators [24.181604511269096]
We propose an action-conditioned dynamics model that predicts scene changes caused by object and agent interactions in a viewpoint-in 3D neural scene representation space.
In this space, objects do not interfere with one another and their appearance persists over time and across viewpoints.
We show our model generalizes well its predictions across varying number and appearances of interacting objects as well as across camera viewpoints.
arXiv Detail & Related papers (2020-11-12T16:15:52Z) - Occlusion resistant learning of intuitive physics from videos [52.25308231683798]
Key ability for artificial systems is to understand physical interactions between objects, and predict future outcomes of a situation.
This ability, often referred to as intuitive physics, has recently received attention and several methods were proposed to learn these physical rules from video sequences.
arXiv Detail & Related papers (2020-04-30T19:35:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.