Related papers: Oh-A-DINO: Understanding and Enhancing Attribute-Level Information in Self-Supervised Object-Centric Representations

Oh-A-DINO: Understanding and Enhancing Attribute-Level Information in Self-Supervised Object-Centric Representations

URL: http://arxiv.org/abs/2503.09867v3
Date: Wed, 01 Oct 2025 19:39:01 GMT
Title: Oh-A-DINO: Understanding and Enhancing Attribute-Level Information in Self-Supervised Object-Centric Representations
Authors: Stefan Sylvius Wagner, Stefan Harmeling,
Abstract summary: Self-supervised vision models and slot-based representations excel at identifying edge-derived geometry but fail to preserve non-geometric surface-level cues.<n>We show that learning an auxiliary latent space over segmented patches, where VAE regularisation enforces compact, disentangled object-centric representations, recovers these missing attributes.
Score: 9.949149600332836
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Object-centric understanding is fundamental to human vision and required for complex reasoning. Traditional methods define slot-based bottlenecks to learn object properties explicitly, while recent self-supervised vision models like DINO have shown emergent object understanding. We investigate the effectiveness of self-supervised representations from models such as CLIP, DINOv2 and DINOv3, as well as slot-based approaches, for multi-object instance retrieval, where specific objects must be faithfully identified in a scene. This scenario is increasingly relevant as pre-trained representations are deployed in downstream tasks, e.g., retrieval, manipulation, and goal-conditioned policies that demand fine-grained object understanding. Our findings reveal that self-supervised vision models and slot-based representations excel at identifying edge-derived geometry (shape, size) but fail to preserve non-geometric surface-level cues (colour, material, texture), which are critical for disambiguating objects when reasoning about or selecting them in such tasks. We show that learning an auxiliary latent space over segmented patches, where VAE regularisation enforces compact, disentangled object-centric representations, recovers these missing attributes. Augmenting the self-supervised methods with such latents improves retrieval across all attributes, suggesting a promising direction for making self-supervised representations more reliable in downstream tasks that require precise object-level reasoning.

Related papers

Are We Done with Object-Centric Learning? [65.67948794110212]
Object-centric learning (OCL) seeks to learn representations that only encode an object, isolated from other objects or background cues in a scene. With recent sample-efficient segmentation models, we can separate objects in the pixel space and encode them independently. We address the OOD generalization challenge caused by spurious background cues through the lens of OCL.
arXiv Detail & Related papers (2025-04-09T17:59:05Z)
CTRL-O: Language-Controllable Object-Centric Visual Representation Learning [30.218743514199016]
Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files" Current object-centric models learn representations based on their preconceived understanding of objects, without allowing user input to guide which objects are represented. We propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions.
arXiv Detail & Related papers (2025-03-27T17:53:50Z)
Bootstrapping Top-down Information for Self-modulating Slot Attention [29.82550058869251]
We propose a novel OCL framework incorporating a top-down pathway. This pathway bootstraps the semantics of individual objects and then modulates the model to prioritize features relevant to these semantics. Our framework achieves state-of-the-art performance across multiple synthetic and real-world object-discovery benchmarks.
arXiv Detail & Related papers (2024-11-04T05:00:49Z)
Learning Global Object-Centric Representations via Disentangled Slot Attention [38.78205074748021]
This paper introduces a novel object-centric learning method to empower AI systems with human-like capabilities to identify objects across scenes and generate diverse scenes containing specific objects by learning a set of global object-centric representations. Experimental results substantiate the efficacy of the proposed method, demonstrating remarkable proficiency in global object-centric representation learning, object identification, scene generation with specific objects and scene decomposition.
arXiv Detail & Related papers (2024-10-24T14:57:00Z)
Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization. We introduce a benchmark comprising eight different synthetic and real-world datasets. We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z)
PEEKABOO: Hiding parts of an image for unsupervised object localization [7.161489957025654]
Localizing objects in an unsupervised manner poses significant challenges due to the absence of key visual information. We propose a single-stage learning framework, dubbed PEEKABOO, for unsupervised object localization. The key idea is to selectively hide parts of an image and leverage the remaining image information to infer the location of objects without explicit supervision.
arXiv Detail & Related papers (2024-07-24T20:35:20Z)
Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner. We design a semantic-guided self-supervised learning model to extract high-level semantic features from images. We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z)
Cycle Consistency Driven Object Discovery [75.60399804639403]
We introduce a method that explicitly optimize the constraint that each object in a scene should be associated with a distinct slot. By integrating these consistency objectives into various existing slot-based object-centric methods, we showcase substantial improvements in object-discovery performance. Our results suggest that the proposed approach not only improves object discovery, but also provides richer features for downstream tasks.
arXiv Detail & Related papers (2023-06-03T21:49:06Z)
SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric Action Recognition [35.4163266882568]
We introduce Self-Supervised Learning Over Sets (SOS) to pre-train a generic Objects In Contact (OIC) representation model. Our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
arXiv Detail & Related papers (2022-04-10T23:27:19Z)
Complex-Valued Autoencoders for Object Discovery [62.26260974933819]
We propose a distributed approach to object-centric representations: the Complex AutoEncoder. We show that this simple and efficient approach achieves better reconstruction performance than an equivalent real-valued autoencoder on simple multi-object datasets. We also show that it achieves competitive unsupervised object discovery performance to a SlotAttention model on two datasets, and manages to disentangle objects in a third dataset where SlotAttention fails - all while being 7-70 times faster to train.
arXiv Detail & Related papers (2022-04-05T09:25:28Z)
Information-Theoretic Odometry Learning [83.36195426897768]
We propose a unified information theoretic framework for learning-motivated methods aimed at odometry estimation. The proposed framework provides an elegant tool for performance evaluation and understanding in information-theoretic language.
arXiv Detail & Related papers (2022-03-11T02:37:35Z)
Object Pursuit: Building a Space of Objects via Discriminative Weight Generation [23.85039747700698]
We propose a framework to continuously learn object-centric representations for visual learning and understanding. We leverage interactions to sample diverse variations of an object and the corresponding training signals while learning the object-centric representations. We perform an extensive study of the key features of the proposed framework and analyze the characteristics of the learned representations.
arXiv Detail & Related papers (2021-12-15T08:25:30Z)
Learning Open-World Object Proposals without Learning to Classify [110.30191531975804]
We propose a classification-free Object Localization Network (OLN) which estimates the objectness of each region purely by how well the location and shape of a region overlaps with any ground-truth object. This simple strategy learns generalizable objectness and outperforms existing proposals on cross-category generalization.
arXiv Detail & Related papers (2021-08-15T14:36:02Z)
Look-into-Object: Self-supervised Structure Modeling for Object Recognition [71.68524003173219]
We propose to "look into object" (explicitly yet intrinsically model the object structure) through incorporating self-supervisions. We show the recognition backbone can be substantially enhanced for more robust representation learning. Our approach achieves large performance gain on a number of benchmarks, including generic object recognition (ImageNet) and fine-grained object recognition tasks (CUB, Cars, Aircraft)
arXiv Detail & Related papers (2020-03-31T12:22:51Z)
Global-Local Bidirectional Reasoning for Unsupervised Representation Learning of 3D Point Clouds [109.0016923028653]
We learn point cloud representation by bidirectional reasoning between the local structures and the global shape without human supervision. We show that our unsupervised model surpasses the state-of-the-art supervised methods on both synthetic and real-world 3D object classification datasets.
arXiv Detail & Related papers (2020-03-29T08:26:08Z)
Relevance-Guided Modeling of Object Dynamics for Reinforcement Learning [0.0951828574518325]
Current deep reinforcement learning (RL) approaches incorporate minimal prior knowledge about the environment. We propose a framework for reasoning about object dynamics and behavior to rapidly determine minimal and task-specific object representations. We also highlight the potential of this framework on several Atari games, using our object representation and standard RL and planning algorithms to learn dramatically faster than existing deep RL algorithms.
arXiv Detail & Related papers (2020-03-03T08:18:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.