Reasoning-Enhanced Object-Centric Learning for Videos
- URL: http://arxiv.org/abs/2403.15245v1
- Date: Fri, 22 Mar 2024 14:41:55 GMT
- Title: Reasoning-Enhanced Object-Centric Learning for Videos
- Authors: Jian Li, Pu Ren, Yang Liu, Hao Sun,
- Abstract summary: We develop a Slot-based Time-Space Transformer with Memory buffer (STATM) to enhance the model's perception ability in complex scenes.
Our experiment results on various datasets show that STATM can significantly enhance object-centric learning capabilities of slot-based video models.
- Score: 15.554898985821302
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Object-centric learning aims to break down complex visual scenes into more manageable object representations, enhancing the understanding and reasoning abilities of machine learning systems toward the physical world. Recently, slot-based video models have demonstrated remarkable proficiency in segmenting and tracking objects, but they overlook the importance of the effective reasoning module. In the real world, reasoning and predictive abilities play a crucial role in human perception and object tracking; in particular, these abilities are closely related to human intuitive physics. Inspired by this, we designed a novel reasoning module called the Slot-based Time-Space Transformer with Memory buffer (STATM) to enhance the model's perception ability in complex scenes. The memory buffer primarily serves as storage for slot information from upstream modules, the Slot-based Time-Space Transformer makes predictions through slot-based spatiotemporal attention computations and fusion. Our experiment results on various datasets show that STATM can significantly enhance object-centric learning capabilities of slot-based video models.
Related papers
- SOLD: Slot Object-Centric Latent Dynamics Models for Relational Manipulation Learning from Pixels [16.020835290802548]
Slot-Attention for Object-centric Latent Dynamics is a novel model-based reinforcement learning algorithm.
It learns object-centric dynamics models in an unsupervised manner from pixel inputs.
We demonstrate that the structured latent space not only improves model interpretability but also provides a valuable input space for behavior models to reason over.
arXiv Detail & Related papers (2024-10-11T14:03:31Z) - E-Motion: Future Motion Simulation via Event Sequence Diffusion [86.80533612211502]
Event-based sensors may potentially offer a unique opportunity to predict future motion with a level of detail and precision previously unachievable.
We propose to integrate the strong learning capacity of the video diffusion model with the rich motion information of an event camera as a motion simulation framework.
Our findings suggest a promising direction for future research in enhancing the interpretative power and predictive accuracy of computer vision systems.
arXiv Detail & Related papers (2024-10-11T09:19:23Z) - SlotGNN: Unsupervised Discovery of Multi-Object Representations and
Visual Dynamics [15.705023986053575]
This paper presents a novel framework for learning multi-object dynamics from visual data using unsupervised techniques.
Two new architectures: SlotTransport for discovering object representations from RGB images and SlotGNN for predicting their collective dynamics from RGB images and robot interactions.
With only minimal additional data, our framework robustly predicts slots and their corresponding dynamics in real-world control tasks.
arXiv Detail & Related papers (2023-10-06T22:37:34Z) - How Physics and Background Attributes Impact Video Transformers in Robotic Manipulation: A Case Study on Planar Pushing [8.435401907462245]
We investigate how physics attributes and scene background characteristics influence the performance of Video Transformers.
We present CloudGripper-Push-1K, a large real-world vision-based robot pushing dataset.
We also propose Video Occlusion Transformer (VOT), a generic modular video-transformer-based trajectory prediction framework.
arXiv Detail & Related papers (2023-10-03T13:35:49Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models [47.986381326169166]
We introduce SlotDiffusion -- an object-centric Latent Diffusion Model (LDM) designed for both image and video data.
Thanks to the powerful modeling capacity of LDMs, SlotDiffusion surpasses previous slot models in unsupervised object segmentation and visual generation.
Our learned object features can be utilized by existing object-centric dynamics models, improving video prediction quality and downstream temporal reasoning tasks.
arXiv Detail & Related papers (2023-05-18T19:56:20Z) - Solving Reasoning Tasks with a Slot Transformer [7.966351917016229]
We present the Slot Transformer, an architecture that leverages slot attention, transformers and iterative variational inference on video scene data to infer representations.
We evaluate the effectiveness of key components of the architecture, the model's representational capacity and its ability to predict from incomplete input.
arXiv Detail & Related papers (2022-10-20T16:40:30Z) - SIM-Trans: Structure Information Modeling Transformer for Fine-grained
Visual Categorization [59.732036564862796]
We propose the Structure Information Modeling Transformer (SIM-Trans) to incorporate object structure information into transformer for enhancing discriminative representation learning.
The proposed two modules are light-weighted and can be plugged into any transformer network and trained end-to-end easily.
Experiments and analyses demonstrate that the proposed SIM-Trans achieves state-of-the-art performance on fine-grained visual categorization benchmarks.
arXiv Detail & Related papers (2022-08-31T03:00:07Z) - Learning Multi-Object Dynamics with Compositional Neural Radiance Fields [63.424469458529906]
We present a method to learn compositional predictive models from image observations based on implicit object encoders, Neural Radiance Fields (NeRFs), and graph neural networks.
NeRFs have become a popular choice for representing scenes due to their strong 3D prior.
For planning, we utilize RRTs in the learned latent space, where we can exploit our model and the implicit object encoder to make sampling the latent space informative and more efficient.
arXiv Detail & Related papers (2022-02-24T01:31:29Z) - Multi-Object Tracking with Deep Learning Ensemble for Unmanned Aerial
System Applications [0.0]
Multi-object tracking (MOT) is a crucial component of situational awareness in military defense applications.
We present a robust object tracking architecture aimed to accommodate for the noise in real-time situations.
We propose a kinematic prediction model, called Deep Extended Kalman Filter (DeepEKF), in which a sequence-to-sequence architecture is used to predict entity trajectories in latent space.
arXiv Detail & Related papers (2021-10-05T13:50:38Z) - Physics-Integrated Variational Autoencoders for Robust and Interpretable
Generative Modeling [86.9726984929758]
We focus on the integration of incomplete physics models into deep generative models.
We propose a VAE architecture in which a part of the latent space is grounded by physics.
We demonstrate generative performance improvements over a set of synthetic and real-world datasets.
arXiv Detail & Related papers (2021-02-25T20:28:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.