Are We Done with Object-Centric Learning?
- URL: http://arxiv.org/abs/2504.07092v2
- Date: Thu, 10 Apr 2025 21:45:00 GMT
- Title: Are We Done with Object-Centric Learning?
- Authors: Alexander Rubinstein, Ameya Prabhu, Matthias Bethge, Seong Joon Oh,
- Abstract summary: Object-centric learning (OCL) seeks to learn representations that only encode an object, isolated from other objects or background cues in a scene.<n>With recent sample-efficient segmentation models, we can separate objects in the pixel space and encode them independently.<n>We address the OOD generalization challenge caused by spurious background cues through the lens of OCL.
- Score: 65.67948794110212
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Object-centric learning (OCL) seeks to learn representations that only encode an object, isolated from other objects or background cues in a scene. This approach underpins various aims, including out-of-distribution (OOD) generalization, sample-efficient composition, and modeling of structured environments. Most research has focused on developing unsupervised mechanisms that separate objects into discrete slots in the representation space, evaluated using unsupervised object discovery. However, with recent sample-efficient segmentation models, we can separate objects in the pixel space and encode them independently. This achieves remarkable zero-shot performance on OOD object discovery benchmarks, is scalable to foundation models, and can handle a variable number of slots out-of-the-box. Hence, the goal of OCL methods to obtain object-centric representations has been largely achieved. Despite this progress, a key question remains: How does the ability to separate objects within a scene contribute to broader OCL objectives, such as OOD generalization? We address this by investigating the OOD generalization challenge caused by spurious background cues through the lens of OCL. We propose a novel, training-free probe called Object-Centric Classification with Applied Masks (OCCAM), demonstrating that segmentation-based encoding of individual objects significantly outperforms slot-based OCL methods. However, challenges in real-world applications remain. We provide the toolbox for the OCL community to use scalable object-centric representations, and focus on practical applications and fundamental questions, such as understanding object perception in human cognition. Our code is available here: https://github.com/AlexanderRubinstein/OCCAM.
Related papers
- Deep Reinforcement Learning via Object-Centric Attention [17.623937562865617]
We introduce Object-Centric Attention via Masking (OCCAM), which selectively preserves task-relevant entities while filtering out irrelevant visual information.<n>OCCAM significantly improves to novel perturbations and reduces sample complexity while showing similar or improved performance compared to conventional pixel-based RL.
arXiv Detail & Related papers (2025-04-03T20:48:27Z) - Object-Aware DINO (Oh-A-Dino): Enhancing Self-Supervised Representations for Multi-Object Instance Retrieval [1.4272411349249627]
Self-supervised vision models like DINO have shown emergent object understanding.
DINO representations excel at capturing global object attributes but struggle with object-level details like colour.
We propose a method that combines global and local features by augmenting DINO representations with object-centric latent vectors.
arXiv Detail & Related papers (2025-03-12T21:57:41Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Cycle Consistency Driven Object Discovery [75.60399804639403]
We introduce a method that explicitly optimize the constraint that each object in a scene should be associated with a distinct slot.
By integrating these consistency objectives into various existing slot-based object-centric methods, we showcase substantial improvements in object-discovery performance.
Our results suggest that the proposed approach not only improves object discovery, but also provides richer features for downstream tasks.
arXiv Detail & Related papers (2023-06-03T21:49:06Z) - SupeRGB-D: Zero-shot Instance Segmentation in Cluttered Indoor
Environments [67.34330257205525]
In this work, we explore zero-shot instance segmentation (ZSIS) from RGB-D data to identify unseen objects in a semantic category-agnostic manner.
We present a method that uses annotated objects to learn the objectness'' of pixels and generalize to unseen object categories in cluttered indoor environments.
arXiv Detail & Related papers (2022-12-22T17:59:48Z) - Learning Dense Object Descriptors from Multiple Views for Low-shot
Category Generalization [27.583517870047487]
We propose Deep Object Patch rimis (DOPE), which can be trained from multiple views of object instances without any category or semantic object part labels.
To train DOPE, we assume access to sparse depths, foreground masks and known cameras, to obtain pixel-level correspondences between views of an object.
We find that DOPE can directly be used for low-shot classification of novel categories using local-part matching, and is competitive with and outperforms supervised and self-supervised learning baselines.
arXiv Detail & Related papers (2022-11-28T04:31:53Z) - Complex-Valued Autoencoders for Object Discovery [62.26260974933819]
We propose a distributed approach to object-centric representations: the Complex AutoEncoder.
We show that this simple and efficient approach achieves better reconstruction performance than an equivalent real-valued autoencoder on simple multi-object datasets.
We also show that it achieves competitive unsupervised object discovery performance to a SlotAttention model on two datasets, and manages to disentangle objects in a third dataset where SlotAttention fails - all while being 7-70 times faster to train.
arXiv Detail & Related papers (2022-04-05T09:25:28Z) - Learning Open-World Object Proposals without Learning to Classify [110.30191531975804]
We propose a classification-free Object Localization Network (OLN) which estimates the objectness of each region purely by how well the location and shape of a region overlaps with any ground-truth object.
This simple strategy learns generalizable objectness and outperforms existing proposals on cross-category generalization.
arXiv Detail & Related papers (2021-08-15T14:36:02Z) - Look-into-Object: Self-supervised Structure Modeling for Object
Recognition [71.68524003173219]
We propose to "look into object" (explicitly yet intrinsically model the object structure) through incorporating self-supervisions.
We show the recognition backbone can be substantially enhanced for more robust representation learning.
Our approach achieves large performance gain on a number of benchmarks, including generic object recognition (ImageNet) and fine-grained object recognition tasks (CUB, Cars, Aircraft)
arXiv Detail & Related papers (2020-03-31T12:22:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.