Related papers: Slot Structured World Models

Slot Structured World Models

URL: http://arxiv.org/abs/2402.03326v1
Date: Mon, 8 Jan 2024 21:19:30 GMT
Title: Slot Structured World Models
Authors: Jonathan Collu, Riccardo Majellaro, Aske Plaat, Thomas M. Moerland
Abstract summary: State-of-the-art approaches use a feedforward encoder to extract object embeddings and a latent graph neural network to model the interaction between these object embeddings. We introduce it Slot Structured World Models (SSWM), a class of world models that combines an it object-centric encoder with a latent graph-based dynamics model.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The ability to perceive and reason about individual objects and their interactions is a goal to be achieved for building intelligent artificial systems. State-of-the-art approaches use a feedforward encoder to extract object embeddings and a latent graph neural network to model the interaction between these object embeddings. However, the feedforward encoder can not extract {\it object-centric} representations, nor can it disentangle multiple objects with similar appearance. To solve these issues, we introduce {\it Slot Structured World Models} (SSWM), a class of world models that combines an {\it object-centric} encoder (based on Slot Attention) with a latent graph-based dynamics model. We evaluate our method in the Spriteworld benchmark with simple rules of physical interaction, where Slot Structured World Models consistently outperform baselines on a range of (multi-step) prediction tasks with action-conditional object interactions. All code to reproduce paper experiments is available from \url{https://github.com/JonathanCollu/Slot-Structured-World-Models}.

Related papers

Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment [6.614005142754584]
Universal Sparse Autoencoders (USAEs) are a framework for uncovering and aligning interpretable concepts spanning multiple deep neural networks. USAEs learn a universal concept space that can reconstruct and interpret the internal activations of multiple models at once.
arXiv Detail & Related papers (2025-02-06T02:06:16Z)
One-Shot Real-to-Sim via End-to-End Differentiable Simulation and Rendering [20.919046758279205]
We introduce a novel rigid object representation that allows the joint identification of properties.<n>Our method employs a novel differentiable point-based geometry representation coupled with a grid-based appearance field.<n>We show that our method can learn both simulation- and rendering-ready world models from only one robot action sequence.
arXiv Detail & Related papers (2024-11-29T21:02:02Z)
Keypoint Abstraction using Large Models for Object-Relative Imitation Learning [78.92043196054071]
Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics. Keypoint-based representations have been proven effective as a succinct representation for essential object capturing features. We propose KALM, a framework that leverages large pre-trained vision-language models to automatically generate task-relevant and cross-instance consistent keypoints.
arXiv Detail & Related papers (2024-10-30T17:37:31Z)
SOLD: Reinforcement Learning with Slot Object-Centric Latent Dynamics [16.020835290802548]
Slot-Attention for Object-centric Latent Dynamics is a novel algorithm that learns object-centric dynamics models from pixel inputs. We demonstrate that the structured latent space not only improves model interpretability but also provides a valuable input space for behavior models to reason over. Our results show that SOLD outperforms DreamerV3, a state-of-the-art model-based RL algorithm, across a range of benchmark robotic environments.
arXiv Detail & Related papers (2024-10-11T14:03:31Z)
SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models [30.313085784715575]
We introduce SlotFormer -- a Transformer-based autoregressive model on learned object-temporal representations. In this paper, we successfully apply SlotFormer to perform prediction on datasets with complex object interactions. We also show its ability to serve as a world model for model-based planning, which is competitive with methods designed specifically for such tasks.
arXiv Detail & Related papers (2022-10-12T01:53:58Z)
Complex-Valued Autoencoders for Object Discovery [62.26260974933819]
We propose a distributed approach to object-centric representations: the Complex AutoEncoder. We show that this simple and efficient approach achieves better reconstruction performance than an equivalent real-valued autoencoder on simple multi-object datasets. We also show that it achieves competitive unsupervised object discovery performance to a SlotAttention model on two datasets, and manages to disentangle objects in a third dataset where SlotAttention fails - all while being 7-70 times faster to train.
arXiv Detail & Related papers (2022-04-05T09:25:28Z)
Discovering Objects that Can Move [55.743225595012966]
We study the problem of object discovery -- separating objects from the background without manual labels. Existing approaches utilize appearance cues, such as color, texture, and location, to group pixels into object-like regions. We choose to focus on dynamic objects -- entities that can move independently in the world.
arXiv Detail & Related papers (2022-03-18T21:13:56Z)
Learning Multi-Object Dynamics with Compositional Neural Radiance Fields [63.424469458529906]
We present a method to learn compositional predictive models from image observations based on implicit object encoders, Neural Radiance Fields (NeRFs), and graph neural networks. NeRFs have become a popular choice for representing scenes due to their strong 3D prior. For planning, we utilize RRTs in the learned latent space, where we can exploit our model and the implicit object encoder to make sampling the latent space informative and more efficient.
arXiv Detail & Related papers (2022-02-24T01:31:29Z)
KINet: Unsupervised Forward Models for Robotic Pushing Manipulation [8.572983995175909]
We introduce KINet -- an unsupervised framework to reason about object interactions based on a keypoint representation. Our model learns to associate objects with keypoint coordinates and discovers a graph representation of the system. By learning to perform physical reasoning in the keypoint space, our model automatically generalizes to scenarios with a different number of objects.
arXiv Detail & Related papers (2022-02-18T03:32:08Z)
Model-Based Visual Planning with Self-Supervised Functional Distances [104.83979811803466]
We present a self-supervised method for model-based visual goal reaching. Our approach learns entirely using offline, unlabeled data. We find that this approach substantially outperforms both model-free and model-based prior methods.
arXiv Detail & Related papers (2020-12-30T23:59:09Z)
Look-into-Object: Self-supervised Structure Modeling for Object Recognition [71.68524003173219]
We propose to "look into object" (explicitly yet intrinsically model the object structure) through incorporating self-supervisions. We show the recognition backbone can be substantially enhanced for more robust representation learning. Our approach achieves large performance gain on a number of benchmarks, including generic object recognition (ImageNet) and fine-grained object recognition tasks (CUB, Cars, Aircraft)
arXiv Detail & Related papers (2020-03-31T12:22:51Z)
CAZSL: Zero-Shot Regression for Pushing Models by Generalizing Through Context [13.217582954907234]
We study the problem of designing deep learning agents which can generalize their models of the physical world by building context-aware models. We present context-aware zero shot learning (CAZSL, pronounced as casual) models, an approach utilizing a Siamese network, embedding space and regularization based on context variables. We test our proposed learning algorithm on the recently released Omnipush datatset that allows testing of meta-learning capabilities.
arXiv Detail & Related papers (2020-03-26T01:21:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.