SlotGNN: Unsupervised Discovery of Multi-Object Representations and
Visual Dynamics
- URL: http://arxiv.org/abs/2310.04617v1
- Date: Fri, 6 Oct 2023 22:37:34 GMT
- Title: SlotGNN: Unsupervised Discovery of Multi-Object Representations and
Visual Dynamics
- Authors: Alireza Rezazadeh, Athreyi Badithela, Karthik Desingh, Changhyun Choi
- Abstract summary: This paper presents a novel framework for learning multi-object dynamics from visual data using unsupervised techniques.
Two new architectures: SlotTransport for discovering object representations from RGB images and SlotGNN for predicting their collective dynamics from RGB images and robot interactions.
With only minimal additional data, our framework robustly predicts slots and their corresponding dynamics in real-world control tasks.
- Score: 15.705023986053575
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning multi-object dynamics from visual data using unsupervised techniques
is challenging due to the need for robust, object representations that can be
learned through robot interactions. This paper presents a novel framework with
two new architectures: SlotTransport for discovering object representations
from RGB images and SlotGNN for predicting their collective dynamics from RGB
images and robot interactions. Our SlotTransport architecture is based on slot
attention for unsupervised object discovery and uses a feature transport
mechanism to maintain temporal alignment in object-centric representations.
This enables the discovery of slots that consistently reflect the composition
of multi-object scenes. These slots robustly bind to distinct objects, even
under heavy occlusion or absence. Our SlotGNN, a novel unsupervised graph-based
dynamics model, predicts the future state of multi-object scenes. SlotGNN
learns a graph representation of the scene using the discovered slots from
SlotTransport and performs relational and spatial reasoning to predict the
future appearance of each slot conditioned on robot actions. We demonstrate the
effectiveness of SlotTransport in learning object-centric features that
accurately encode both visual and positional information. Further, we highlight
the accuracy of SlotGNN in downstream robotic tasks, including challenging
multi-object rearrangement and long-horizon prediction. Finally, our
unsupervised approach proves effective in the real world. With only minimal
additional data, our framework robustly predicts slots and their corresponding
dynamics in real-world control tasks.
Related papers
- MMRDN: Consistent Representation for Multi-View Manipulation
Relationship Detection in Object-Stacked Scenes [62.20046129613934]
We propose a novel multi-view fusion framework, namely multi-view MRD network (MMRDN)
We project the 2D data from different views into a common hidden space and fit the embeddings with a set of Von-Mises-Fisher distributions.
We select a set of $K$ Maximum Vertical Neighbors (KMVN) points from the point cloud of each object pair, which encodes the relative position of these two objects.
arXiv Detail & Related papers (2023-04-25T05:55:29Z) - Invariant Slot Attention: Object Discovery with Slot-Centric Reference
Frames [18.84636947819183]
Slot-based neural networks that learn about objects in a self-supervised manner have made exciting progress.
We present a simple yet highly effective method for incorporating spatial symmetries via slot-centric reference frames.
We evaluate our method on a range of synthetic object discovery benchmarks namely CLEVR, Tetrominoes, CLEVR, Objects Room and MultiShapeNet.
arXiv Detail & Related papers (2023-02-09T23:25:28Z) - SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric
Models [30.313085784715575]
We introduce SlotFormer -- a Transformer-based autoregressive model on learned object-temporal representations.
In this paper, we successfully apply SlotFormer to perform prediction on datasets with complex object interactions.
We also show its ability to serve as a world model for model-based planning, which is competitive with methods designed specifically for such tasks.
arXiv Detail & Related papers (2022-10-12T01:53:58Z) - Learn to Predict How Humans Manipulate Large-sized Objects from
Interactive Motions [82.90906153293585]
We propose a graph neural network, HO-GCN, to fuse motion data and dynamic descriptors for the prediction task.
We show the proposed network that consumes dynamic descriptors can achieve state-of-the-art prediction results and help the network better generalize to unseen objects.
arXiv Detail & Related papers (2022-06-25T09:55:39Z) - MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic
Grasping via Physics-based Metaverse Synthesis [78.26022688167133]
We present a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis.
The proposed dataset contains 100,000 images and 25 different object types.
We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance.
arXiv Detail & Related papers (2021-12-29T17:23:24Z) - Lifelong 3D Object Recognition and Grasp Synthesis Using Dual Memory
Recurrent Self-Organization Networks [0.0]
Humans learn to recognize and manipulate new objects in lifelong settings without forgetting the previously gained knowledge.
In most conventional deep neural networks, this is not possible due to the problem of catastrophic forgetting.
We propose a hybrid model architecture consisting of a dual-memory recurrent neural network and an autoencoder to tackle object recognition and grasping simultaneously.
arXiv Detail & Related papers (2021-09-23T11:14:13Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z) - DS-Net: Dynamic Spatiotemporal Network for Video Salient Object
Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information.
We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z) - Learning Object-Based State Estimators for Household Robots [11.055133590909097]
We build object-based memory systems that operate on high-dimensional observations and hypotheses.
We demonstrate the system's effectiveness in maintaining memory of dynamically changing objects in both simulated environment and real images.
arXiv Detail & Related papers (2020-11-06T04:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.