Object-Aware DINO (Oh-A-Dino): Enhancing Self-Supervised Representations for Multi-Object Instance Retrieval
- URL: http://arxiv.org/abs/2503.09867v2
- Date: Sun, 16 Mar 2025 01:00:45 GMT
- Title: Object-Aware DINO (Oh-A-Dino): Enhancing Self-Supervised Representations for Multi-Object Instance Retrieval
- Authors: Stefan Sylvius Wagner, Stefan Harmeling,
- Abstract summary: Self-supervised vision models like DINO have shown emergent object understanding.<n>DINO representations excel at capturing global object attributes but struggle with object-level details like colour.<n>We propose a method that combines global and local features by augmenting DINO representations with object-centric latent vectors.
- Score: 1.4272411349249627
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Object-centric learning is fundamental to human vision and crucial for models requiring complex reasoning. Traditional approaches rely on slot-based bottlenecks to learn object properties explicitly, while recent self-supervised vision models like DINO have shown emergent object understanding. However, DINO representations primarily capture global scene features, often confounding individual object attributes. We investigate the effectiveness of DINO representations and slot-based methods for multi-object instance retrieval. Our findings reveal that DINO representations excel at capturing global object attributes such as object shape and size, but struggle with object-level details like colour, whereas slot-based representations struggle at both global and object-level understanding. To address this, we propose a method that combines global and local features by augmenting DINO representations with object-centric latent vectors from a Variational Autoencoder trained on segmented image patches that are extracted from the DINO features. This approach improves multi-object instance retrieval performance, bridging the gap between global scene understanding and fine-grained object representation without requiring full model retraining.
Related papers
- Are We Done with Object-Centric Learning? [65.67948794110212]
Object-centric learning (OCL) seeks to learn representations that only encode an object, isolated from other objects or background cues in a scene.
With recent sample-efficient segmentation models, we can separate objects in the pixel space and encode them independently.
We address the OOD generalization challenge caused by spurious background cues through the lens of OCL.
arXiv Detail & Related papers (2025-04-09T17:59:05Z) - CTRL-O: Language-Controllable Object-Centric Visual Representation Learning [30.218743514199016]
Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files"
Current object-centric models learn representations based on their preconceived understanding of objects, without allowing user input to guide which objects are represented.
We propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions.
arXiv Detail & Related papers (2025-03-27T17:53:50Z) - Bootstrapping Top-down Information for Self-modulating Slot Attention [29.82550058869251]
We propose a novel OCL framework incorporating a top-down pathway.
This pathway bootstraps the semantics of individual objects and then modulates the model to prioritize features relevant to these semantics.
Our framework achieves state-of-the-art performance across multiple synthetic and real-world object-discovery benchmarks.
arXiv Detail & Related papers (2024-11-04T05:00:49Z) - Learning Global Object-Centric Representations via Disentangled Slot Attention [38.78205074748021]
This paper introduces a novel object-centric learning method to empower AI systems with human-like capabilities to identify objects across scenes and generate diverse scenes containing specific objects by learning a set of global object-centric representations.
Experimental results substantiate the efficacy of the proposed method, demonstrating remarkable proficiency in global object-centric representation learning, object identification, scene generation with specific objects and scene decomposition.
arXiv Detail & Related papers (2024-10-24T14:57:00Z) - Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric
Action Recognition [35.4163266882568]
We introduce Self-Supervised Learning Over Sets (SOS) to pre-train a generic Objects In Contact (OIC) representation model.
Our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
arXiv Detail & Related papers (2022-04-10T23:27:19Z) - Complex-Valued Autoencoders for Object Discovery [62.26260974933819]
We propose a distributed approach to object-centric representations: the Complex AutoEncoder.
We show that this simple and efficient approach achieves better reconstruction performance than an equivalent real-valued autoencoder on simple multi-object datasets.
We also show that it achieves competitive unsupervised object discovery performance to a SlotAttention model on two datasets, and manages to disentangle objects in a third dataset where SlotAttention fails - all while being 7-70 times faster to train.
arXiv Detail & Related papers (2022-04-05T09:25:28Z) - Learning Open-World Object Proposals without Learning to Classify [110.30191531975804]
We propose a classification-free Object Localization Network (OLN) which estimates the objectness of each region purely by how well the location and shape of a region overlaps with any ground-truth object.
This simple strategy learns generalizable objectness and outperforms existing proposals on cross-category generalization.
arXiv Detail & Related papers (2021-08-15T14:36:02Z) - Look-into-Object: Self-supervised Structure Modeling for Object
Recognition [71.68524003173219]
We propose to "look into object" (explicitly yet intrinsically model the object structure) through incorporating self-supervisions.
We show the recognition backbone can be substantially enhanced for more robust representation learning.
Our approach achieves large performance gain on a number of benchmarks, including generic object recognition (ImageNet) and fine-grained object recognition tasks (CUB, Cars, Aircraft)
arXiv Detail & Related papers (2020-03-31T12:22:51Z) - Global-Local Bidirectional Reasoning for Unsupervised Representation
Learning of 3D Point Clouds [109.0016923028653]
We learn point cloud representation by bidirectional reasoning between the local structures and the global shape without human supervision.
We show that our unsupervised model surpasses the state-of-the-art supervised methods on both synthetic and real-world 3D object classification datasets.
arXiv Detail & Related papers (2020-03-29T08:26:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.