Object-centric Learning with Cyclic Walks between Parts and Whole
- URL: http://arxiv.org/abs/2302.08023v2
- Date: Wed, 1 Nov 2023 09:08:27 GMT
- Title: Object-centric Learning with Cyclic Walks between Parts and Whole
- Authors: Ziyu Wang, Mike Zheng Shou, Mengmi Zhang
- Abstract summary: Learning object-centric representations from complex natural environments enables both humans and machines with reasoning abilities from low-level perceptual features.
We propose cyclic walks between perceptual features extracted from vision transformers and object entities.
In contrast to object-centric models attached with a decoder for the pixel-level or feature-level reconstructions, our cyclic walks provide strong learning signals.
- Score: 23.561434374097864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning object-centric representations from complex natural environments
enables both humans and machines with reasoning abilities from low-level
perceptual features. To capture compositional entities of the scene, we
proposed cyclic walks between perceptual features extracted from vision
transformers and object entities. First, a slot-attention module interfaces
with these perceptual features and produces a finite set of slot
representations. These slots can bind to any object entities in the scene via
inter-slot competitions for attention. Next, we establish entity-feature
correspondence with cyclic walks along high transition probability based on the
pairwise similarity between perceptual features (aka "parts") and slot-binded
object representations (aka "whole"). The whole is greater than its parts and
the parts constitute the whole. The part-whole interactions form cycle
consistencies, as supervisory signals, to train the slot-attention module. Our
rigorous experiments on \textit{seven} image datasets in \textit{three}
\textit{unsupervised} tasks demonstrate that the networks trained with our
cyclic walks can disentangle foregrounds and backgrounds, discover objects, and
segment semantic objects in complex scenes. In contrast to object-centric
models attached with a decoder for the pixel-level or feature-level
reconstructions, our cyclic walks provide strong learning signals, avoiding
computation overheads and enhancing memory efficiency. Our source code and data
are available at:
\href{https://github.com/ZhangLab-DeepNeuroCogLab/Parts-Whole-Object-Centric-Learning/}{link}.
Related papers
- Object Discovery from Motion-Guided Tokens [50.988525184497334]
We augment the auto-encoder representation learning framework with motion-guidance and mid-level feature tokenization.
Our approach enables the emergence of interpretable object-specific mid-level features.
arXiv Detail & Related papers (2023-03-27T19:14:00Z) - Framework-agnostic Semantically-aware Global Reasoning for Segmentation [29.69187816377079]
We propose a component that learns to project image features into latent representations and reason between them.
Our design encourages the latent regions to represent semantic concepts by ensuring that the activated regions are spatially disjoint.
Our latent tokens are semantically interpretable and diverse and provide a rich set of features that can be transferred to downstream tasks.
arXiv Detail & Related papers (2022-12-06T21:42:05Z) - Robust and Controllable Object-Centric Learning through Energy-based
Models [95.68748828339059]
ours is a conceptually simple and general approach to learning object-centric representations through an energy-based model.
We show that ours can be easily integrated into existing architectures and can effectively extract high-quality object-centric representations.
arXiv Detail & Related papers (2022-10-11T15:11:15Z) - Object Scene Representation Transformer [56.40544849442227]
We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis.
OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods.
It is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder.
arXiv Detail & Related papers (2022-06-14T15:40:47Z) - Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data.
We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z) - Complex-Valued Autoencoders for Object Discovery [62.26260974933819]
We propose a distributed approach to object-centric representations: the Complex AutoEncoder.
We show that this simple and efficient approach achieves better reconstruction performance than an equivalent real-valued autoencoder on simple multi-object datasets.
We also show that it achieves competitive unsupervised object discovery performance to a SlotAttention model on two datasets, and manages to disentangle objects in a third dataset where SlotAttention fails - all while being 7-70 times faster to train.
arXiv Detail & Related papers (2022-04-05T09:25:28Z) - Sim2Real Object-Centric Keypoint Detection and Description [40.58367357980036]
Keypoint detection and description play a central role in computer vision.
We propose the object-centric formulation, which requires further identifying which object each interest point belongs to.
We develop a sim2real contrastive learning mechanism that can generalize the model trained in simulation to real-world applications.
arXiv Detail & Related papers (2022-02-01T15:00:20Z) - Where2Act: From Pixels to Actions for Articulated 3D Objects [54.19638599501286]
We extract highly localized actionable information related to elementary actions such as pushing or pulling for articulated objects with movable parts.
We propose a learning-from-interaction framework with an online data sampling strategy that allows us to train the network in simulation.
Our learned models even transfer to real-world data.
arXiv Detail & Related papers (2021-01-07T18:56:38Z) - Object-Centric Learning with Slot Attention [43.684193749891506]
We present the Slot Attention module, an architectural component that interfaces with perceptual representations.
Slot Attention produces task-dependent abstract representations which we call slots.
We empirically demonstrate that Slot Attention can extract object-centric representations that enable generalization to unseen compositions.
arXiv Detail & Related papers (2020-06-26T15:31:57Z) - A Deep Learning Approach to Object Affordance Segmentation [31.221897360610114]
We design an autoencoder that infers pixel-wise affordance labels in both videos and static images.
Our model surpasses the need for object labels and bounding boxes by using a soft-attention mechanism.
We show that our model achieves competitive results compared to strongly supervised methods on SOR3D-AFF.
arXiv Detail & Related papers (2020-04-18T15:34:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.