Reasoning-Enhanced Object-Centric Learning for Videos
- URL: http://arxiv.org/abs/2403.15245v1
- Date: Fri, 22 Mar 2024 14:41:55 GMT
- Title: Reasoning-Enhanced Object-Centric Learning for Videos
- Authors: Jian Li, Pu Ren, Yang Liu, Hao Sun,
- Abstract summary: We develop a Slot-based Time-Space Transformer with Memory buffer (STATM) to enhance the model's perception ability in complex scenes.
Our experiment results on various datasets show that STATM can significantly enhance object-centric learning capabilities of slot-based video models.
- Score: 15.554898985821302
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Object-centric learning aims to break down complex visual scenes into more manageable object representations, enhancing the understanding and reasoning abilities of machine learning systems toward the physical world. Recently, slot-based video models have demonstrated remarkable proficiency in segmenting and tracking objects, but they overlook the importance of the effective reasoning module. In the real world, reasoning and predictive abilities play a crucial role in human perception and object tracking; in particular, these abilities are closely related to human intuitive physics. Inspired by this, we designed a novel reasoning module called the Slot-based Time-Space Transformer with Memory buffer (STATM) to enhance the model's perception ability in complex scenes. The memory buffer primarily serves as storage for slot information from upstream modules, the Slot-based Time-Space Transformer makes predictions through slot-based spatiotemporal attention computations and fusion. Our experiment results on various datasets show that STATM can significantly enhance object-centric learning capabilities of slot-based video models.
Related papers
- SOLD: Reinforcement Learning with Slot Object-Centric Latent Dynamics [16.020835290802548]
Slot-Attention for Object-centric Latent Dynamics is a novel algorithm that learns object-centric dynamics models from pixel inputs.
We demonstrate that the structured latent space not only improves model interpretability but also provides a valuable input space for behavior models to reason over.
Our results show that SOLD outperforms DreamerV3, a state-of-the-art model-based RL algorithm, across a range of benchmark robotic environments.
arXiv Detail & Related papers (2024-10-11T14:03:31Z) - E-Motion: Future Motion Simulation via Event Sequence Diffusion [86.80533612211502]
Event-based sensors may potentially offer a unique opportunity to predict future motion with a level of detail and precision previously unachievable.
We propose to integrate the strong learning capacity of the video diffusion model with the rich motion information of an event camera as a motion simulation framework.
Our findings suggest a promising direction for future research in enhancing the interpretative power and predictive accuracy of computer vision systems.
arXiv Detail & Related papers (2024-10-11T09:19:23Z) - SlotGNN: Unsupervised Discovery of Multi-Object Representations and
Visual Dynamics [15.705023986053575]
This paper presents a novel framework for learning multi-object dynamics from visual data using unsupervised techniques.
Two new architectures: SlotTransport for discovering object representations from RGB images and SlotGNN for predicting their collective dynamics from RGB images and robot interactions.
With only minimal additional data, our framework robustly predicts slots and their corresponding dynamics in real-world control tasks.
arXiv Detail & Related papers (2023-10-06T22:37:34Z) - Spotlight Attention: Robust Object-Centric Learning With a Spatial
Locality Prior [88.9319150230121]
Object-centric vision aims to construct an explicit representation of the objects in a scene.
We incorporate a spatial-locality prior into state-of-the-art object-centric vision models.
We obtain significant improvements in segmenting objects in both synthetic and real-world datasets.
arXiv Detail & Related papers (2023-05-31T04:35:50Z) - SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models [47.986381326169166]
We introduce SlotDiffusion -- an object-centric Latent Diffusion Model (LDM) designed for both image and video data.
Thanks to the powerful modeling capacity of LDMs, SlotDiffusion surpasses previous slot models in unsupervised object segmentation and visual generation.
Our learned object features can be utilized by existing object-centric dynamics models, improving video prediction quality and downstream temporal reasoning tasks.
arXiv Detail & Related papers (2023-05-18T19:56:20Z) - Solving Reasoning Tasks with a Slot Transformer [7.966351917016229]
We present the Slot Transformer, an architecture that leverages slot attention, transformers and iterative variational inference on video scene data to infer representations.
We evaluate the effectiveness of key components of the architecture, the model's representational capacity and its ability to predict from incomplete input.
arXiv Detail & Related papers (2022-10-20T16:40:30Z) - Robust and Controllable Object-Centric Learning through Energy-based
Models [95.68748828339059]
ours is a conceptually simple and general approach to learning object-centric representations through an energy-based model.
We show that ours can be easily integrated into existing architectures and can effectively extract high-quality object-centric representations.
arXiv Detail & Related papers (2022-10-11T15:11:15Z) - Entropy-driven Unsupervised Keypoint Representation Learning in Videos [7.940371647421243]
We present a novel approach for unsupervised learning of meaningful representations from videos.
We argue that textitlocal entropy of pixel neighborhoods and their temporal evolution create valuable intrinsic supervisory signals for learning prominent features.
Our empirical results show superior performance for our information-driven keypoints that resolve challenges like attendance to static and dynamic objects or objects abruptly entering and leaving the scene.
arXiv Detail & Related papers (2022-09-30T12:03:52Z) - SIM-Trans: Structure Information Modeling Transformer for Fine-grained
Visual Categorization [59.732036564862796]
We propose the Structure Information Modeling Transformer (SIM-Trans) to incorporate object structure information into transformer for enhancing discriminative representation learning.
The proposed two modules are light-weighted and can be plugged into any transformer network and trained end-to-end easily.
Experiments and analyses demonstrate that the proposed SIM-Trans achieves state-of-the-art performance on fine-grained visual categorization benchmarks.
arXiv Detail & Related papers (2022-08-31T03:00:07Z) - Dynamic Visual Reasoning by Learning Differentiable Physics Models from
Video and Language [92.7638697243969]
We propose a unified framework that can jointly learn visual concepts and infer physics models of objects from videos and language.
This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine.
arXiv Detail & Related papers (2021-10-28T17:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.