Object-Centric Representation Learning with Generative Spatial-Temporal
Factorization
- URL: http://arxiv.org/abs/2111.05393v1
- Date: Tue, 9 Nov 2021 20:04:16 GMT
- Title: Object-Centric Representation Learning with Generative Spatial-Temporal
Factorization
- Authors: Li Nanbo, Muhammad Ahmed Raza, Hu Wenbin, Zhaole Sun, Robert B. Fisher
- Abstract summary: We propose Dynamics-aware Multi-Object Network (DyMON), a method that broadens the scope of multi-view object-centric representation learning to dynamic scenes.
We show that DyMON learns to factorize the entangled effects of observer motions and scene object dynamics from a sequence of observations.
We also show that the factorized scene representations support querying about a single object by space and time independently.
- Score: 5.403549896734018
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Learning object-centric scene representations is essential for attaining
structural understanding and abstraction of complex scenes. Yet, as current
approaches for unsupervised object-centric representation learning are built
upon either a stationary observer assumption or a static scene assumption, they
often: i) suffer single-view spatial ambiguities, or ii) infer incorrectly or
inaccurately object representations from dynamic scenes. To address this, we
propose Dynamics-aware Multi-Object Network (DyMON), a method that broadens the
scope of multi-view object-centric representation learning to dynamic scenes.
We train DyMON on multi-view-dynamic-scene data and show that DyMON learns --
without supervision -- to factorize the entangled effects of observer motions
and scene object dynamics from a sequence of observations, and constructs scene
object spatial representations suitable for rendering at arbitrary times
(querying across time) and from arbitrary viewpoints (querying across space).
We also show that the factorized scene representations (w.r.t. objects) support
querying about a single object by space and time independently.
Related papers
- Learning Global Object-Centric Representations via Disentangled Slot Attention [38.78205074748021]
This paper introduces a novel object-centric learning method to empower AI systems with human-like capabilities to identify objects across scenes and generate diverse scenes containing specific objects by learning a set of global object-centric representations.
Experimental results substantiate the efficacy of the proposed method, demonstrating remarkable proficiency in global object-centric representation learning, object identification, scene generation with specific objects and scene decomposition.
arXiv Detail & Related papers (2024-10-24T14:57:00Z) - Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Variational Inference for Scalable 3D Object-centric Learning [19.445804699433353]
We tackle the task of scalable unsupervised object-centric representation learning on 3D scenes.
Existing approaches to object-centric representation learning show limitations in generalizing to larger scenes.
We propose to learn view-invariant 3D object representations in localized object coordinate systems.
arXiv Detail & Related papers (2023-09-25T10:23:40Z) - Hyperbolic Contrastive Learning for Visual Representations beyond
Objects [30.618032825306187]
We focus on learning representations for objects and scenes that preserve the structure among them.
Motivated by the observation that visually similar objects are close in the representation space, we argue that the scenes and objects should instead follow a hierarchical structure.
arXiv Detail & Related papers (2022-12-01T16:58:57Z) - Compositional Scene Modeling with Global Object-Centric Representations [44.43366905943199]
Humans can easily identify the same object, even if occlusions exist, by completing the occluded parts based on its canonical image in the memory.
This paper proposes a compositional scene modeling method to infer global representations of canonical images of objects without any supervision.
arXiv Detail & Related papers (2022-11-21T14:36:36Z) - Robust and Controllable Object-Centric Learning through Energy-based
Models [95.68748828339059]
ours is a conceptually simple and general approach to learning object-centric representations through an energy-based model.
We show that ours can be easily integrated into existing architectures and can effectively extract high-quality object-centric representations.
arXiv Detail & Related papers (2022-10-11T15:11:15Z) - Discovering Objects that Can Move [55.743225595012966]
We study the problem of object discovery -- separating objects from the background without manual labels.
Existing approaches utilize appearance cues, such as color, texture, and location, to group pixels into object-like regions.
We choose to focus on dynamic objects -- entities that can move independently in the world.
arXiv Detail & Related papers (2022-03-18T21:13:56Z) - Learning Object-Centric Representations of Multi-Object Scenes from
Multiple Views [9.556376932449187]
Multi-View and Multi-Object Network (MulMON) is a method for learning accurate, object-centric representations of multi-object scenes by leveraging multiple views.
We show that MulMON better-resolves spatial ambiguities than single-view methods.
arXiv Detail & Related papers (2021-11-13T13:54:28Z) - SIMONe: View-Invariant, Temporally-Abstracted Object Representations via
Unsupervised Video Decomposition [69.90530987240899]
We present an unsupervised variational approach to this problem.
Our model learns to infer two sets of latent representations from RGB video input alone.
It represents object attributes in an allocentric manner which does not depend on viewpoint.
arXiv Detail & Related papers (2021-06-07T17:59:23Z) - Self-Supervised Representation Learning from Flow Equivariance [97.13056332559526]
We present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes.
Our representations, learned from high-resolution raw video, can be readily used for downstream tasks on static images.
arXiv Detail & Related papers (2021-01-16T23:44:09Z) - Benchmarking Unsupervised Object Representations for Video Sequences [111.81492107649889]
We compare the perceptual abilities of four object-centric approaches: ViMON, OP3, TBA and SCALOR.
Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking.
Our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
arXiv Detail & Related papers (2020-06-12T09:37:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.