Unsupervised Object-Centric Learning with Bi-Level Optimized Query Slot
Attention
- URL: http://arxiv.org/abs/2210.08990v1
- Date: Mon, 17 Oct 2022 12:14:59 GMT
- Title: Unsupervised Object-Centric Learning with Bi-Level Optimized Query Slot
Attention
- Authors: Baoxiong Jia, Yu Liu, Siyuan Huang
- Abstract summary: Slot-Attention module has played an important role with its simple yet effective design and fostered many powerful variants.
We propose to address these issues by initializing Slot-Attention modules with learnable queries and (2) optimizing the model with bi-level optimization.
Our model achieves state-of-the-art results on both synthetic and complex real-world datasets in unsupervised image segmentation and reconstruction.
- Score: 26.25900877220557
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The ability to decompose complex natural scenes into meaningful
object-centric abstractions lies at the core of human perception and reasoning.
In the recent culmination of unsupervised object-centric learning, the
Slot-Attention module has played an important role with its simple yet
effective design and fostered many powerful variants. These methods, however,
have been exceedingly difficult to train without supervision and are ambiguous
in the notion of object, especially for complex natural scenes. In this paper,
we propose to address these issues by (1) initializing Slot-Attention modules
with learnable queries and (2) optimizing the model with bi-level optimization.
With simple code adjustments on the vanilla Slot-Attention, our model, Bi-level
Optimized Query Slot Attention, achieves state-of-the-art results on both
synthetic and complex real-world datasets in unsupervised image segmentation
and reconstruction, outperforming previous baselines by a large margin (~10%).
We provide thorough ablative studies to validate the necessity and
effectiveness of our design. Additionally, our model exhibits excellent
potential for concept binding and zero-shot learning. We hope our effort could
provide a single home for the design and learning of slot-based models and pave
the way for more challenging tasks in object-centric learning. Our
implementation is publicly available at
https://github.com/Wall-Facer-liuyu/BO-QSA.
Related papers
- ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models [39.520825264698374]
Vision Transformers (ViTs) have delivered remarkable progress through global self-attention, yet their quadratic complexity can become prohibitive for high-resolution inputs.
We present ViT-Linearizer, a cross-architecture distillation framework that transfers rich ViT representations into a linear-time, recurrent-style model.
Our results underscore the good potential of RNN-based solutions for large-scale visual tasks, bridging the gap between theoretical efficiency and real-world practice.
arXiv Detail & Related papers (2025-03-30T15:35:24Z) - A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning [67.72413262980272]
Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear.
We develop SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck.
Our approach achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations.
arXiv Detail & Related papers (2025-03-10T06:18:31Z) - Efficient Exploration and Discriminative World Model Learning with an Object-Centric Abstraction [19.59151245929067]
We study whether giving an agent an object-centric mapping (describing a set of items and their attributes) allow for more efficient learning.
We find this problem is best solved hierarchically by modelling items at a higher level of state abstraction to pixels.
We make use of this to propose a fully model-based algorithm that learns a discriminative world model.
arXiv Detail & Related papers (2024-08-21T17:59:31Z) - On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning [33.89483627891117]
Recent advancements in language and vision assistants have showcased impressive capabilities but suffer from a lack of transparency.
Open-source models handle general image tasks effectively, but face challenges with the high computational demands of complex visually-situated text understanding.
This study aims to redefine the design of vision-language models by identifying key components and creating efficient models with constrained inference costs.
arXiv Detail & Related papers (2024-06-17T17:57:30Z) - Emergence and Function of Abstract Representations in Self-Supervised
Transformers [0.0]
We study the inner workings of small-scale transformers trained to reconstruct partially masked visual scenes.
We show that the network develops intermediate abstract representations, or abstractions, that encode all semantic features of the dataset.
Using precise manipulation experiments, we demonstrate that abstractions are central to the network's decision-making process.
arXiv Detail & Related papers (2023-12-08T20:47:15Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Cycle Consistency Driven Object Discovery [75.60399804639403]
We introduce a method that explicitly optimize the constraint that each object in a scene should be associated with a distinct slot.
By integrating these consistency objectives into various existing slot-based object-centric methods, we showcase substantial improvements in object-discovery performance.
Our results suggest that the proposed approach not only improves object discovery, but also provides richer features for downstream tasks.
arXiv Detail & Related papers (2023-06-03T21:49:06Z) - Robust and Controllable Object-Centric Learning through Energy-based
Models [95.68748828339059]
ours is a conceptually simple and general approach to learning object-centric representations through an energy-based model.
We show that ours can be easily integrated into existing architectures and can effectively extract high-quality object-centric representations.
arXiv Detail & Related papers (2022-10-11T15:11:15Z) - Sim2Real Object-Centric Keypoint Detection and Description [40.58367357980036]
Keypoint detection and description play a central role in computer vision.
We propose the object-centric formulation, which requires further identifying which object each interest point belongs to.
We develop a sim2real contrastive learning mechanism that can generalize the model trained in simulation to real-world applications.
arXiv Detail & Related papers (2022-02-01T15:00:20Z) - Reinforcement Learning for Sparse-Reward Object-Interaction Tasks in a
First-person Simulated 3D Environment [73.9469267445146]
First-person object-interaction tasks in high-fidelity, 3D, simulated environments such as the AI2Thor pose significant sample-efficiency challenges for reinforcement learning agents.
We show that one can learn object-interaction tasks from scratch without supervision by learning an attentive object-model as an auxiliary task.
arXiv Detail & Related papers (2020-10-28T19:27:26Z) - Goal-Aware Prediction: Learning to Model What Matters [105.43098326577434]
One of the fundamental challenges in using a learned forward dynamics model is the mismatch between the objective of the learned model and that of the downstream planner or policy.
We propose to direct prediction towards task relevant information, enabling the model to be aware of the current task and encouraging it to only model relevant quantities of the state space.
We find that our method more effectively models the relevant parts of the scene conditioned on the goal, and as a result outperforms standard task-agnostic dynamics models and model-free reinforcement learning.
arXiv Detail & Related papers (2020-07-14T16:42:59Z) - Look-into-Object: Self-supervised Structure Modeling for Object
Recognition [71.68524003173219]
We propose to "look into object" (explicitly yet intrinsically model the object structure) through incorporating self-supervisions.
We show the recognition backbone can be substantially enhanced for more robust representation learning.
Our approach achieves large performance gain on a number of benchmarks, including generic object recognition (ImageNet) and fine-grained object recognition tasks (CUB, Cars, Aircraft)
arXiv Detail & Related papers (2020-03-31T12:22:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.