Shepherding Slots to Objects: Towards Stable and Robust Object-Centric
Learning
- URL: http://arxiv.org/abs/2303.17842v1
- Date: Fri, 31 Mar 2023 07:07:29 GMT
- Title: Shepherding Slots to Objects: Towards Stable and Robust Object-Centric
Learning
- Authors: Jinwoo Kim, Janghyuk Choi, Ho-Jin Choi, Seon Joo Kim
- Abstract summary: Single-view images carry less information about how to disentangle a given scene than videos or multi-view images do.
We introduce a novel OCL framework for single-view images, SLot Attention via SHepherding (SLASH), which consists of two simple-yet-effective modules on top of Slot Attention.
Our proposed method enables consistent learning of object-centric representation and achieves strong performance across four datasets.
- Score: 28.368429312400885
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Object-centric learning (OCL) aspires general and compositional understanding
of scenes by representing a scene as a collection of object-centric
representations. OCL has also been extended to multi-view image and video
datasets to apply various data-driven inductive biases by utilizing geometric
or temporal information in the multi-image data. Single-view images carry less
information about how to disentangle a given scene than videos or multi-view
images do. Hence, owing to the difficulty of applying inductive biases, OCL for
single-view images remains challenging, resulting in inconsistent learning of
object-centric representation. To this end, we introduce a novel OCL framework
for single-view images, SLot Attention via SHepherding (SLASH), which consists
of two simple-yet-effective modules on top of Slot Attention. The new modules,
Attention Refining Kernel (ARK) and Intermediate Point Predictor and Encoder
(IPPE), respectively, prevent slots from being distracted by the background
noise and indicate locations for slots to focus on to facilitate learning of
object-centric representation. We also propose a weak semi-supervision approach
for OCL, whilst our proposed framework can be used without any assistant
annotation during the inference. Experiments show that our proposed method
enables consistent learning of object-centric representation and achieves
strong performance across four datasets. Code is available at
\url{https://github.com/object-understanding/SLASH}.
Related papers
- PooDLe: Pooled and dense self-supervised learning from naturalistic videos [32.656425302538835]
We propose a novel approach that combines an invariance-based SSL objective on pooled representations with a dense SSL objective.
We validate our approach on the BDD100K driving video dataset and the Walking Tours first-person video dataset.
arXiv Detail & Related papers (2024-08-20T21:40:48Z) - Object-level Scene Deocclusion [92.39886029550286]
We present a new self-supervised PArallel visible-to-COmplete diffusion framework, named PACO, for object-level scene deocclusion.
To train PACO, we create a large-scale dataset with 500k samples to enable self-supervised learning.
Experiments on COCOA and various real-world scenes demonstrate the superior capability of PACO for scene deocclusion, surpassing the state of the arts by a large margin.
arXiv Detail & Related papers (2024-06-11T20:34:10Z) - Learning Object-Centric Representation via Reverse Hierarchy Guidance [73.05170419085796]
Object-Centric Learning (OCL) seeks to enable Neural Networks to identify individual objects in visual scenes.
RHGNet introduces a top-down pathway that works in different ways in the training and inference processes.
Our model achieves SOTA performance on several commonly used datasets.
arXiv Detail & Related papers (2024-05-17T07:48:27Z) - CrIBo: Self-Supervised Learning via Cross-Image Object-Level
Bootstrapping [40.94237853380154]
We introduce a novel Cross-Image Object-Level Bootstrapping method tailored to enhance dense visual representation learning.
CrIBo emerges as a notably strong and adequate candidate for in-context learning, leveraging nearest neighbor retrieval at test time.
arXiv Detail & Related papers (2023-10-11T19:57:51Z) - Spotlight Attention: Robust Object-Centric Learning With a Spatial
Locality Prior [88.9319150230121]
Object-centric vision aims to construct an explicit representation of the objects in a scene.
We incorporate a spatial-locality prior into state-of-the-art object-centric vision models.
We obtain significant improvements in segmenting objects in both synthetic and real-world datasets.
arXiv Detail & Related papers (2023-05-31T04:35:50Z) - De-coupling and De-positioning Dense Self-supervised Learning [65.56679416475943]
Dense Self-Supervised Learning (SSL) methods address the limitations of using image-level feature representations when handling images with multiple objects.
We show that they suffer from coupling and positional bias, which arise from the receptive field increasing with layer depth and zero-padding.
We demonstrate the benefits of our method on COCO and on a new challenging benchmark, OpenImage-MINI, for object classification, semantic segmentation, and object detection.
arXiv Detail & Related papers (2023-03-29T18:07:25Z) - Hyperbolic Contrastive Learning for Visual Representations beyond
Objects [30.618032825306187]
We focus on learning representations for objects and scenes that preserve the structure among them.
Motivated by the observation that visually similar objects are close in the representation space, we argue that the scenes and objects should instead follow a hierarchical structure.
arXiv Detail & Related papers (2022-12-01T16:58:57Z) - Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data.
We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z) - AutoRF: Learning 3D Object Radiance Fields from Single View Observations [17.289819674602295]
AutoRF is a new approach for learning neural 3D object representations where each object in the training set is observed by only a single view.
We show that our method generalizes well to unseen objects, even across different datasets of challenging real-world street scenes.
arXiv Detail & Related papers (2022-04-07T17:13:39Z) - Object discovery and representation networks [78.16003886427885]
We propose a self-supervised learning paradigm that discovers the structure encoded in priors by itself.
Our method, Odin, couples object discovery and representation networks to discover meaningful image segmentations without any supervision.
arXiv Detail & Related papers (2022-03-16T17:42:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.