Learning Object-Centric Representations of Multi-Object Scenes from
Multiple Views
- URL: http://arxiv.org/abs/2111.07117v1
- Date: Sat, 13 Nov 2021 13:54:28 GMT
- Title: Learning Object-Centric Representations of Multi-Object Scenes from
Multiple Views
- Authors: Li Nanbo, Cian Eastwood, Robert B. Fisher
- Abstract summary: Multi-View and Multi-Object Network (MulMON) is a method for learning accurate, object-centric representations of multi-object scenes by leveraging multiple views.
We show that MulMON better-resolves spatial ambiguities than single-view methods.
- Score: 9.556376932449187
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Learning object-centric representations of multi-object scenes is a promising
approach towards machine intelligence, facilitating high-level reasoning and
control from visual sensory data. However, current approaches for unsupervised
object-centric scene representation are incapable of aggregating information
from multiple observations of a scene. As a result, these "single-view" methods
form their representations of a 3D scene based only on a single 2D observation
(view). Naturally, this leads to several inaccuracies, with these methods
falling victim to single-view spatial ambiguities. To address this, we propose
The Multi-View and Multi-Object Network (MulMON) -- a method for learning
accurate, object-centric representations of multi-object scenes by leveraging
multiple views. In order to sidestep the main technical difficulty of the
multi-object-multi-view scenario -- maintaining object correspondences across
views -- MulMON iteratively updates the latent object representations for a
scene over multiple views. To ensure that these iterative updates do indeed
aggregate spatial information to form a complete 3D scene understanding, MulMON
is asked to predict the appearance of the scene from novel viewpoints during
training. Through experiments, we show that MulMON better-resolves spatial
ambiguities than single-view methods -- learning more accurate and disentangled
object representations -- and also achieves new functionality in predicting
object segmentations for novel viewpoints.
Related papers
- Unsupervised Object-Centric Learning from Multiple Unspecified
Viewpoints [45.88397367354284]
We consider a novel problem of learning compositional scene representations from multiple unspecified viewpoints without using any supervision.
We propose a deep generative model which separates latent representations into a viewpoint-independent part and a viewpoint-dependent part to solve this problem.
Experiments on several specifically designed synthetic datasets have shown that the proposed method can effectively learn from multiple unspecified viewpoints.
arXiv Detail & Related papers (2024-01-03T15:09:25Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - Variational Inference for Scalable 3D Object-centric Learning [19.445804699433353]
We tackle the task of scalable unsupervised object-centric representation learning on 3D scenes.
Existing approaches to object-centric representation learning show limitations in generalizing to larger scenes.
We propose to learn view-invariant 3D object representations in localized object coordinate systems.
arXiv Detail & Related papers (2023-09-25T10:23:40Z) - MMRDN: Consistent Representation for Multi-View Manipulation
Relationship Detection in Object-Stacked Scenes [62.20046129613934]
We propose a novel multi-view fusion framework, namely multi-view MRD network (MMRDN)
We project the 2D data from different views into a common hidden space and fit the embeddings with a set of Von-Mises-Fisher distributions.
We select a set of $K$ Maximum Vertical Neighbors (KMVN) points from the point cloud of each object pair, which encodes the relative position of these two objects.
arXiv Detail & Related papers (2023-04-25T05:55:29Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving.
We present SimMOD, a Simple baseline for Multi-camera Object Detection.
We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z) - AutoRF: Learning 3D Object Radiance Fields from Single View Observations [17.289819674602295]
AutoRF is a new approach for learning neural 3D object representations where each object in the training set is observed by only a single view.
We show that our method generalizes well to unseen objects, even across different datasets of challenging real-world street scenes.
arXiv Detail & Related papers (2022-04-07T17:13:39Z) - Unsupervised Learning of Compositional Scene Representations from
Multiple Unspecified Viewpoints [41.07379505694274]
We consider a novel problem of learning compositional scene representations from multiple unspecified viewpoints without using any supervision.
We propose a deep generative model which separates latent representations into a viewpoint-independent part and a viewpoint-dependent part to solve this problem.
Experiments on several specifically designed synthetic datasets have shown that the proposed method is able to effectively learn from multiple unspecified viewpoints.
arXiv Detail & Related papers (2021-12-07T08:45:21Z) - Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics.
We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention.
We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z) - Object-Centric Representation Learning with Generative Spatial-Temporal
Factorization [5.403549896734018]
We propose Dynamics-aware Multi-Object Network (DyMON), a method that broadens the scope of multi-view object-centric representation learning to dynamic scenes.
We show that DyMON learns to factorize the entangled effects of observer motions and scene object dynamics from a sequence of observations.
We also show that the factorized scene representations support querying about a single object by space and time independently.
arXiv Detail & Related papers (2021-11-09T20:04:16Z) - Self-Supervised Multi-View Learning via Auto-Encoding 3D Transformations [61.870882736758624]
We propose a novel self-supervised paradigm to learn Multi-View Transformation Equivariant Representations (MV-TER)
Specifically, we perform a 3D transformation on a 3D object, and obtain multiple views before and after the transformation via projection.
Then, we self-train a representation to capture the intrinsic 3D object representation by decoding 3D transformation parameters from the fused feature representations of multiple views before and after the transformation.
arXiv Detail & Related papers (2021-03-01T06:24:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.