Related papers: Slot Attention with Re-Initialization and Self-Distillation

Slot Attention with Re-Initialization and Self-Distillation

URL: http://arxiv.org/abs/2507.23755v1
Date: Thu, 31 Jul 2025 17:41:18 GMT
Title: Slot Attention with Re-Initialization and Self-Distillation
Authors: Rongzhen Zhao, Yi Zhao, Juho Kannala, Joni Pajarinen,
Abstract summary: We propose Slot Attention with re-Initialization and self-Distillation (DIAS) for object discovery and recognition.<n>DIAS achieves state-of-the-art on OCL tasks like object discovery and recognition, while also improving advanced visual prediction and reasoning.
Score: 22.024377849671033
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Unlike popular solutions based on dense feature maps, Object-Centric Learning (OCL) represents visual scenes as sub-symbolic object-level feature vectors, termed slots, which are highly versatile for tasks involving visual modalities. OCL typically aggregates object superpixels into slots by iteratively applying competitive cross attention, known as Slot Attention, with the slots as the query. However, once initialized, these slots are reused naively, causing redundant slots to compete with informative ones for representing objects. This often results in objects being erroneously segmented into parts. Additionally, mainstream methods derive supervision signals solely from decoding slots into the input's reconstruction, overlooking potential supervision based on internal information. To address these issues, we propose Slot Attention with re-Initialization and self-Distillation (DIAS): $\emph{i)}$ We reduce redundancy in the aggregated slots and re-initialize extra aggregation to update the remaining slots; $\emph{ii)}$ We drive the bad attention map at the first aggregation iteration to approximate the good at the last iteration to enable self-distillation. Experiments demonstrate that DIAS achieves state-of-the-art on OCL tasks like object discovery and recognition, while also improving advanced visual prediction and reasoning. Our code is available on https://github.com/Genera1Z/DIAS.

Related papers

MetaSlot: Break Through the Fixed Number of Slots in Object-Centric Learning [11.365829102707014]
We introduce MetaSlot, a plug-and-play Slot Attention variant that adapts to variable object counts.<n>We show that MetaSlot achieves significant performance gains and markedly interpretable slot representations, compared with existing Slot Attention variants.
arXiv Detail & Related papers (2025-05-27T06:23:03Z)
Are We Done with Object-Centric Learning? [65.67948794110212]
Object-centric learning (OCL) seeks to learn representations that only encode an object, isolated from other objects or background cues in a scene.<n>With recent sample-efficient segmentation models, we can separate objects in the pixel space and encode them independently.<n>We address the OOD generalization challenge caused by spurious background cues through the lens of OCL.
arXiv Detail & Related papers (2025-04-09T17:59:05Z)
Attention Normalization Impacts Cardinality Generalization in Slot Attention [6.9099729240700825]
We propose and investigate alternatives to the original normalization scheme which increase the capabilities of Slot Attention to varying slot and object counts. The newly proposed normalizations represent minimal and easy to implement modifications of the usual Slot Attention module.
arXiv Detail & Related papers (2024-07-04T22:09:01Z)
Adaptive Slot Attention: Object Discovery with Dynamic Slot Number [64.45419820717754]
A major drawback of most object-centric models, including slot attention, is their reliance on predefining the number of slots. Within this framework, we introduce an adaptive slot attention (AdaSlot) mechanism that dynamically determines the optimal number of slots. Our framework, tested extensively on object discovery tasks with various datasets, shows performance matching or exceeding top fixed-slot models.
arXiv Detail & Related papers (2024-06-13T14:55:11Z)
Object-Centric Multiple Object Tracking [124.30650395969126]
This paper proposes a video object-centric model for multiple-object tracking pipelines. It consists of an index-merge module that adapts the object-centric slots into detection outputs and an object memory module. Benefited from object-centric learning, we only require sparse detection labels for object localization and feature binding.
arXiv Detail & Related papers (2023-09-01T03:34:12Z)
Enhancing Interpretable Object Abstraction via Clustering-based Slot Initialization [17.25953277219166]
We present a new method for object-centric representations using slots. Our method outperforms prior works consistently. We evaluate our method on object discovery and novel view synthesis tasks with various datasets.
arXiv Detail & Related papers (2023-08-22T11:48:43Z)
Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos [63.94040814459116]
Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence. We propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps. We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations.
arXiv Detail & Related papers (2023-08-19T09:12:13Z)
Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner. We design a semantic-guided self-supervised learning model to extract high-level semantic features from images. We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z)
Shepherding Slots to Objects: Towards Stable and Robust Object-Centric Learning [28.368429312400885]
Single-view images carry less information about how to disentangle a given scene than videos or multi-view images do. We introduce a novel OCL framework for single-view images, SLot Attention via SHepherding (SLASH), which consists of two simple-yet-effective modules on top of Slot Attention. Our proposed method enables consistent learning of object-centric representation and achieves strong performance across four datasets.
arXiv Detail & Related papers (2023-03-31T07:07:29Z)
Self-Supervised Video Object Segmentation via Cutout Prediction and Tagging [117.73967303377381]
We propose a novel self-supervised Video Object (VOS) approach that strives to achieve better object-background discriminability. Our approach is based on a discriminative learning loss formulation that takes into account both object and background information. Our proposed approach, CT-VOS, achieves state-of-the-art results on two challenging benchmarks: DAVIS-2017 and Youtube-VOS.
arXiv Detail & Related papers (2022-04-22T17:53:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.