Guided Latent Slot Diffusion for Object-Centric Learning
- URL: http://arxiv.org/abs/2407.17929v1
- Date: Thu, 25 Jul 2024 10:38:32 GMT
- Title: Guided Latent Slot Diffusion for Object-Centric Learning
- Authors: Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth,
- Abstract summary: We introduce Guided Latent Slot Diffusion - GLASS, an object-centric model that uses generated captions as a guiding signal to better align slots with objects.
For object discovery, GLASS achieves approx. a +35% and +10% relative improvement for mIoU over the previous state-of-the-art (SOTA) method.
For the segmentation task, GLASS surpasses SOTA weakly-supervised and language-based segmentation models, which were specifically designed for the task.
- Score: 13.721373817758307
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Slot attention aims to decompose an input image into a set of meaningful object files (slots). These latent object representations enable various downstream tasks. Yet, these slots often bind to object parts, not objects themselves, especially for real-world datasets. To address this, we introduce Guided Latent Slot Diffusion - GLASS, an object-centric model that uses generated captions as a guiding signal to better align slots with objects. Our key insight is to learn the slot-attention module in the space of generated images. This allows us to repurpose the pre-trained diffusion decoder model, which reconstructs the images from the slots, as a semantic mask generator based on the generated captions. GLASS learns an object-level representation suitable for multiple tasks simultaneously, e.g., segmentation, image generation, and property prediction, outperforming previous methods. For object discovery, GLASS achieves approx. a +35% and +10% relative improvement for mIoU over the previous state-of-the-art (SOTA) method on the VOC and COCO datasets, respectively, and establishes a new SOTA FID score for conditional image generation amongst slot-attention-based methods. For the segmentation task, GLASS surpasses SOTA weakly-supervised and language-based segmentation models, which were specifically designed for the task.
Related papers
- SODAWideNet++: Combining Attention and Convolutions for Salient Object Detection [3.2586315449885106]
We propose a novel encoder-decoder-style neural network called SODAWideNet++ designed explicitly for Salient Object Detection.
Inspired by the vision transformers ability to attain a global receptive field from the initial stages, we introduce the Attention Guided Long Range Feature Extraction (AGLRFE) module.
In contrast to the current paradigm of ImageNet pre-training, we modify 118K annotated images from the COCO semantic segmentation dataset by binarizing the annotations to pre-train the proposed model end-to-end.
arXiv Detail & Related papers (2024-08-29T15:51:06Z) - General Object Foundation Model for Images and Videos at Scale [99.2806103051613]
We present GLEE, an object-level foundation model for locating and identifying objects in images and videos.
GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario.
We employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks.
arXiv Detail & Related papers (2023-12-14T17:26:00Z) - Grounding Everything: Emerging Localization Properties in
Vision-Language Transformers [51.260510447308306]
We show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning.
We propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path.
We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation.
arXiv Detail & Related papers (2023-12-01T19:06:12Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models [47.986381326169166]
We introduce SlotDiffusion -- an object-centric Latent Diffusion Model (LDM) designed for both image and video data.
Thanks to the powerful modeling capacity of LDMs, SlotDiffusion surpasses previous slot models in unsupervised object segmentation and visual generation.
Our learned object features can be utilized by existing object-centric dynamics models, improving video prediction quality and downstream temporal reasoning tasks.
arXiv Detail & Related papers (2023-05-18T19:56:20Z) - De-coupling and De-positioning Dense Self-supervised Learning [65.56679416475943]
Dense Self-Supervised Learning (SSL) methods address the limitations of using image-level feature representations when handling images with multiple objects.
We show that they suffer from coupling and positional bias, which arise from the receptive field increasing with layer depth and zero-padding.
We demonstrate the benefits of our method on COCO and on a new challenging benchmark, OpenImage-MINI, for object classification, semantic segmentation, and object detection.
arXiv Detail & Related papers (2023-03-29T18:07:25Z) - Guided Slot Attention for Unsupervised Video Object Segmentation [16.69412563413671]
We propose a guided slot attention network to reinforce spatial structural information and obtain better foreground--background separation.
The proposed model achieves state-of-the-art performance on two popular datasets.
arXiv Detail & Related papers (2023-03-15T02:08:20Z) - OSIC: A New One-Stage Image Captioner Coined [38.46732302316068]
We propose a novel One-Stage Image Captioner (OSIC) with dynamic multi-sight learning.
To obtain rich features, we use the Swin Transformer to calculate multi-level features.
To enhance the global modeling of encoder for caption, we propose a new dual-dimensional refining module.
arXiv Detail & Related papers (2022-11-04T08:50:09Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.