Towards an Interpretable Latent Space in Structured Models for Video
Prediction
- URL: http://arxiv.org/abs/2107.07713v1
- Date: Fri, 16 Jul 2021 05:37:16 GMT
- Title: Towards an Interpretable Latent Space in Structured Models for Video
Prediction
- Authors: Rushil Gupta, Vishal Sharma, Yash Jain, Yitao Liang, Guy Van den
Broeck and Parag Singla
- Abstract summary: We focus on the task of future frame prediction in video governed by underlying physical dynamics.
We work with models which are object-centric, i.e., explicitly work with object representations, and propagate a loss in the latent space.
- Score: 30.080907495461876
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We focus on the task of future frame prediction in video governed by
underlying physical dynamics. We work with models which are object-centric,
i.e., explicitly work with object representations, and propagate a loss in the
latent space. Specifically, our research builds on recent work by Kipf et al.
\cite{kipf&al20}, which predicts the next state via contrastive learning of
object interactions in a latent space using a Graph Neural Network. We argue
that injecting explicit inductive bias in the model, in form of general
physical laws, can help not only make the model more interpretable, but also
improve the overall prediction of model. As a natural by-product, our model can
learn feature maps which closely resemble actual object positions in the image,
without having any explicit supervision about the object positions at the
training time. In comparison with earlier works \cite{jaques&al20}, which
assume a complete knowledge of the dynamics governing the motion in the form of
a physics engine, we rely only on the knowledge of general physical laws, such
as, world consists of objects, which have position and velocity. We propose an
additional decoder based loss in the pixel space, imposed in a curriculum
manner, to further refine the latent space predictions. Experiments in multiple
different settings demonstrate that while Kipf et al. model is effective at
capturing object interactions, our model can be significantly more effective at
localising objects, resulting in improved performance in 3 out of 4 domains
that we experiment with. Additionally, our model can learn highly intrepretable
feature maps, resembling actual object positions.
Related papers
- Learning Disentangled Representation in Object-Centric Models for Visual Dynamics Prediction via Transformers [11.155818952879146]
Recent work has shown that object-centric representations can greatly help improve the accuracy of learning dynamics.
Can learning disentangled representation further improve the accuracy of visual dynamics prediction in object-centric models?
We try to learn such disentangled representations for the case of static images citepnsb, without making any specific assumptions about the kind of attributes that an object might have.
arXiv Detail & Related papers (2024-07-03T15:43:54Z) - 3D-IntPhys: Towards More Generalized 3D-grounded Visual Intuitive
Physics under Challenging Scenes [68.66237114509264]
We present a framework capable of learning 3D-grounded visual intuitive physics models from videos of complex scenes with fluids.
We show our model can make long-horizon future predictions by learning from raw images and significantly outperforms models that do not employ an explicit 3D representation space.
arXiv Detail & Related papers (2023-04-22T19:28:49Z) - Learning Multi-Object Dynamics with Compositional Neural Radiance Fields [63.424469458529906]
We present a method to learn compositional predictive models from image observations based on implicit object encoders, Neural Radiance Fields (NeRFs), and graph neural networks.
NeRFs have become a popular choice for representing scenes due to their strong 3D prior.
For planning, we utilize RRTs in the learned latent space, where we can exploit our model and the implicit object encoder to make sampling the latent space informative and more efficient.
arXiv Detail & Related papers (2022-02-24T01:31:29Z) - KINet: Unsupervised Forward Models for Robotic Pushing Manipulation [8.572983995175909]
We introduce KINet -- an unsupervised framework to reason about object interactions based on a keypoint representation.
Our model learns to associate objects with keypoint coordinates and discovers a graph representation of the system.
By learning to perform physical reasoning in the keypoint space, our model automatically generalizes to scenarios with a different number of objects.
arXiv Detail & Related papers (2022-02-18T03:32:08Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z) - Hindsight for Foresight: Unsupervised Structured Dynamics Models from
Physical Interaction [24.72947291987545]
Key challenge for an agent learning to interact with the world is to reason about physical properties of objects.
We propose a novel approach for modeling the dynamics of a robot's interactions directly from unlabeled 3D point clouds and images.
arXiv Detail & Related papers (2020-08-02T11:04:49Z) - Occlusion resistant learning of intuitive physics from videos [52.25308231683798]
Key ability for artificial systems is to understand physical interactions between objects, and predict future outcomes of a situation.
This ability, often referred to as intuitive physics, has recently received attention and several methods were proposed to learn these physical rules from video sequences.
arXiv Detail & Related papers (2020-04-30T19:35:54Z) - Visual Grounding of Learned Physical Models [66.04898704928517]
Humans intuitively recognize objects' physical properties and predict their motion, even when the objects are engaged in complicated interactions.
We present a neural model that simultaneously reasons about physics and makes future predictions based on visual and dynamics priors.
Experiments show that our model can infer the physical properties within a few observations, which allows the model to quickly adapt to unseen scenarios and make accurate predictions into the future.
arXiv Detail & Related papers (2020-04-28T17:06:38Z) - Predicting the Physical Dynamics of Unseen 3D Objects [65.49291702488436]
We focus on predicting the dynamics of 3D objects on a plane that have just been subjected to an impulsive force.
Our approach can generalize to object shapes and initial conditions that were unseen during training.
Our model can support training with data from both a physics engine or the real world.
arXiv Detail & Related papers (2020-01-16T06:27:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.