SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric
Models
- URL: http://arxiv.org/abs/2210.05861v1
- Date: Wed, 12 Oct 2022 01:53:58 GMT
- Title: SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric
Models
- Authors: Ziyi Wu, Nikita Dvornik, Klaus Greff, Thomas Kipf, Animesh Garg
- Abstract summary: We introduce SlotFormer -- a Transformer-based autoregressive model on learned object-temporal representations.
In this paper, we successfully apply SlotFormer to perform prediction on datasets with complex object interactions.
We also show its ability to serve as a world model for model-based planning, which is competitive with methods designed specifically for such tasks.
- Score: 30.313085784715575
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding dynamics from visual observations is a challenging problem that
requires disentangling individual objects from the scene and learning their
interactions. While recent object-centric models can successfully decompose a
scene into objects, modeling their dynamics effectively still remains a
challenge. We address this problem by introducing SlotFormer -- a
Transformer-based autoregressive model operating on learned object-centric
representations. Given a video clip, our approach reasons over object features
to model spatio-temporal relationships and predicts accurate future object
states. In this paper, we successfully apply SlotFormer to perform video
prediction on datasets with complex object interactions. Moreover, the
unsupervised SlotFormer's dynamics model can be used to improve the performance
on supervised downstream tasks, such as Visual Question Answering (VQA), and
goal-conditioned planning. Compared to past works on dynamics modeling, our
method achieves significantly better long-term synthesis of object dynamics,
while retaining high quality visual generation. Besides, SlotFormer enables VQA
models to reason about the future without object-level labels, even
outperforming counterparts that use ground-truth annotations. Finally, we show
its ability to serve as a world model for model-based planning, which is
competitive with methods designed specifically for such tasks.
Related papers
- Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases [69.46487306858789]
Conditional Autoregressive Slot Attention (CA-SA) is a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks.
We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks.
arXiv Detail & Related papers (2024-10-21T07:44:44Z) - SOLD: Reinforcement Learning with Slot Object-Centric Latent Dynamics [16.020835290802548]
Slot-Attention for Object-centric Latent Dynamics is a novel algorithm that learns object-centric dynamics models from pixel inputs.
We demonstrate that the structured latent space not only improves model interpretability but also provides a valuable input space for behavior models to reason over.
Our results show that SOLD outperforms DreamerV3, a state-of-the-art model-based RL algorithm, across a range of benchmark robotic environments.
arXiv Detail & Related papers (2024-10-11T14:03:31Z) - Unsupervised Dynamics Prediction with Object-Centric Kinematics [22.119612406160073]
We propose Object-Centric Kinematics (OCK), a framework for dynamics prediction leveraging object-centric representations.
OCK consists of low-level structured states of objects' position, velocity, and acceleration.
Our model demonstrates superior performance when handling objects and backgrounds in complex scenes characterized by a wide range of object attributes and dynamic movements.
arXiv Detail & Related papers (2024-04-29T04:47:23Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models [47.986381326169166]
We introduce SlotDiffusion -- an object-centric Latent Diffusion Model (LDM) designed for both image and video data.
Thanks to the powerful modeling capacity of LDMs, SlotDiffusion surpasses previous slot models in unsupervised object segmentation and visual generation.
Our learned object features can be utilized by existing object-centric dynamics models, improving video prediction quality and downstream temporal reasoning tasks.
arXiv Detail & Related papers (2023-05-18T19:56:20Z) - Model-Based Visual Planning with Self-Supervised Functional Distances [104.83979811803466]
We present a self-supervised method for model-based visual goal reaching.
Our approach learns entirely using offline, unlabeled data.
We find that this approach substantially outperforms both model-free and model-based prior methods.
arXiv Detail & Related papers (2020-12-30T23:59:09Z) - Planning from Pixels using Inverse Dynamics Models [44.16528631970381]
We propose a novel way to learn latent world models by learning to predict sequences of future actions conditioned on task completion.
We evaluate our method on challenging visual goal completion tasks and show a substantial increase in performance compared to prior model-free approaches.
arXiv Detail & Related papers (2020-12-04T06:07:36Z) - Goal-Aware Prediction: Learning to Model What Matters [105.43098326577434]
One of the fundamental challenges in using a learned forward dynamics model is the mismatch between the objective of the learned model and that of the downstream planner or policy.
We propose to direct prediction towards task relevant information, enabling the model to be aware of the current task and encouraging it to only model relevant quantities of the state space.
We find that our method more effectively models the relevant parts of the scene conditioned on the goal, and as a result outperforms standard task-agnostic dynamics models and model-free reinforcement learning.
arXiv Detail & Related papers (2020-07-14T16:42:59Z) - Learning Predictive Representations for Deformable Objects Using
Contrastive Estimation [83.16948429592621]
We propose a new learning framework that jointly optimize both the visual representation model and the dynamics model.
We show substantial improvements over standard model-based learning techniques across our rope and cloth manipulation suite.
arXiv Detail & Related papers (2020-03-11T17:55:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.