Learning Global Object-Centric Representations via Disentangled Slot Attention
- URL: http://arxiv.org/abs/2410.18809v1
- Date: Thu, 24 Oct 2024 14:57:00 GMT
- Title: Learning Global Object-Centric Representations via Disentangled Slot Attention
- Authors: Tonglin Chen, Yinxuan Huang, Zhimeng Shen, Jinghao Huang, Bin Li, Xiangyang Xue,
- Abstract summary: This paper introduces a novel object-centric learning method to empower AI systems with human-like capabilities to identify objects across scenes and generate diverse scenes containing specific objects by learning a set of global object-centric representations.
Experimental results substantiate the efficacy of the proposed method, demonstrating remarkable proficiency in global object-centric representation learning, object identification, scene generation with specific objects and scene decomposition.
- Score: 38.78205074748021
- License:
- Abstract: Humans can discern scene-independent features of objects across various environments, allowing them to swiftly identify objects amidst changing factors such as lighting, perspective, size, and position and imagine the complete images of the same object in diverse settings. Existing object-centric learning methods only extract scene-dependent object-centric representations, lacking the ability to identify the same object across scenes as humans. Moreover, some existing methods discard the individual object generation capabilities to handle complex scenes. This paper introduces a novel object-centric learning method to empower AI systems with human-like capabilities to identify objects across scenes and generate diverse scenes containing specific objects by learning a set of global object-centric representations. To learn the global object-centric representations that encapsulate globally invariant attributes of objects (i.e., the complete appearance and shape), this paper designs a Disentangled Slot Attention module to convert the scene features into scene-dependent attributes (such as scale, position and orientation) and scene-independent representations (i.e., appearance and shape). Experimental results substantiate the efficacy of the proposed method, demonstrating remarkable proficiency in global object-centric representation learning, object identification, scene generation with specific objects and scene decomposition.
Related papers
- Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Variational Inference for Scalable 3D Object-centric Learning [19.445804699433353]
We tackle the task of scalable unsupervised object-centric representation learning on 3D scenes.
Existing approaches to object-centric representation learning show limitations in generalizing to larger scenes.
We propose to learn view-invariant 3D object representations in localized object coordinate systems.
arXiv Detail & Related papers (2023-09-25T10:23:40Z) - AnyDoor: Zero-shot Object-level Image Customization [63.44307304097742]
This work presents AnyDoor, a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations.
Our model is trained only once and effortlessly generalizes to diverse object-scene combinations at the inference stage.
arXiv Detail & Related papers (2023-07-18T17:59:02Z) - Compositional Scene Modeling with Global Object-Centric Representations [44.43366905943199]
Humans can easily identify the same object, even if occlusions exist, by completing the occluded parts based on its canonical image in the memory.
This paper proposes a compositional scene modeling method to infer global representations of canonical images of objects without any supervision.
arXiv Detail & Related papers (2022-11-21T14:36:36Z) - Discovering Objects that Can Move [55.743225595012966]
We study the problem of object discovery -- separating objects from the background without manual labels.
Existing approaches utilize appearance cues, such as color, texture, and location, to group pixels into object-like regions.
We choose to focus on dynamic objects -- entities that can move independently in the world.
arXiv Detail & Related papers (2022-03-18T21:13:56Z) - Object Pursuit: Building a Space of Objects via Discriminative Weight
Generation [23.85039747700698]
We propose a framework to continuously learn object-centric representations for visual learning and understanding.
We leverage interactions to sample diverse variations of an object and the corresponding training signals while learning the object-centric representations.
We perform an extensive study of the key features of the proposed framework and analyze the characteristics of the learned representations.
arXiv Detail & Related papers (2021-12-15T08:25:30Z) - Object-Centric Representation Learning with Generative Spatial-Temporal
Factorization [5.403549896734018]
We propose Dynamics-aware Multi-Object Network (DyMON), a method that broadens the scope of multi-view object-centric representation learning to dynamic scenes.
We show that DyMON learns to factorize the entangled effects of observer motions and scene object dynamics from a sequence of observations.
We also show that the factorized scene representations support querying about a single object by space and time independently.
arXiv Detail & Related papers (2021-11-09T20:04:16Z) - Visiting the Invisible: Layer-by-Layer Completed Scene Decomposition [57.088328223220934]
Existing scene understanding systems mainly focus on recognizing the visible parts of a scene, ignoring the intact appearance of physical objects in the real-world.
In this work, we propose a higher-level scene understanding system to tackle both visible and invisible parts of objects and backgrounds in a given scene.
arXiv Detail & Related papers (2021-04-12T11:37:23Z) - ROOTS: Object-Centric Representation and Rendering of 3D Scenes [28.24758046060324]
A crucial ability of human intelligence is to build up models of individual 3D objects from partial scene observations.
Recent works achieve object-centric generation but without the ability to infer the representation.
We propose a probabilistic generative model for learning to build modular and compositional 3D object models.
arXiv Detail & Related papers (2020-06-11T00:42:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.