Unsupervised Learning of Compositional Scene Representations from
Multiple Unspecified Viewpoints
- URL: http://arxiv.org/abs/2112.03568v1
- Date: Tue, 7 Dec 2021 08:45:21 GMT
- Title: Unsupervised Learning of Compositional Scene Representations from
Multiple Unspecified Viewpoints
- Authors: Jinyang Yuan, Bin Li, Xiangyang Xue
- Abstract summary: We consider a novel problem of learning compositional scene representations from multiple unspecified viewpoints without using any supervision.
We propose a deep generative model which separates latent representations into a viewpoint-independent part and a viewpoint-dependent part to solve this problem.
Experiments on several specifically designed synthetic datasets have shown that the proposed method is able to effectively learn from multiple unspecified viewpoints.
- Score: 41.07379505694274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual scenes are extremely rich in diversity, not only because there are
infinite combinations of objects and background, but also because the
observations of the same scene may vary greatly with the change of viewpoints.
When observing a visual scene that contains multiple objects from multiple
viewpoints, humans are able to perceive the scene in a compositional way from
each viewpoint, while achieving the so-called "object constancy" across
different viewpoints, even though the exact viewpoints are untold. This ability
is essential for humans to identify the same object while moving and to learn
from vision efficiently. It is intriguing to design models that have the
similar ability. In this paper, we consider a novel problem of learning
compositional scene representations from multiple unspecified viewpoints
without using any supervision, and propose a deep generative model which
separates latent representations into a viewpoint-independent part and a
viewpoint-dependent part to solve this problem. To infer latent
representations, the information contained in different viewpoints is
iteratively integrated by neural networks. Experiments on several specifically
designed synthetic datasets have shown that the proposed method is able to
effectively learn from multiple unspecified viewpoints.
Related papers
- Improving Viewpoint-Independent Object-Centric Representations through Active Viewpoint Selection [41.419853273742746]
We propose a novel active viewpoint selection strategy for object-centric learning.
It predicts images from unknown viewpoints based on information from observation images for each scene.
Our method can accurately predict images from unknown viewpoints.
arXiv Detail & Related papers (2024-11-01T07:01:44Z) - Learning Global Object-Centric Representations via Disentangled Slot Attention [38.78205074748021]
This paper introduces a novel object-centric learning method to empower AI systems with human-like capabilities to identify objects across scenes and generate diverse scenes containing specific objects by learning a set of global object-centric representations.
Experimental results substantiate the efficacy of the proposed method, demonstrating remarkable proficiency in global object-centric representation learning, object identification, scene generation with specific objects and scene decomposition.
arXiv Detail & Related papers (2024-10-24T14:57:00Z) - Unsupervised Object-Centric Learning from Multiple Unspecified
Viewpoints [45.88397367354284]
We consider a novel problem of learning compositional scene representations from multiple unspecified viewpoints without using any supervision.
We propose a deep generative model which separates latent representations into a viewpoint-independent part and a viewpoint-dependent part to solve this problem.
Experiments on several specifically designed synthetic datasets have shown that the proposed method can effectively learn from multiple unspecified viewpoints.
arXiv Detail & Related papers (2024-01-03T15:09:25Z) - A Computational Account Of Self-Supervised Visual Learning From
Egocentric Object Play [3.486683381782259]
We study how learning signals that equate different viewpoints can support robust visual learning.
We find that representations learned by equating different physical viewpoints of an object benefit downstream image classification accuracy.
arXiv Detail & Related papers (2023-05-30T22:42:03Z) - Neural Groundplans: Persistent Neural Scene Representations from a
Single Image [90.04272671464238]
We present a method to map 2D image observations of a scene to a persistent 3D scene representation.
We propose conditional neural groundplans as persistent and memory-efficient scene representations.
arXiv Detail & Related papers (2022-07-22T17:41:24Z) - Peripheral Vision Transformer [52.55309200601883]
We take a biologically inspired approach and explore to model peripheral vision in deep neural networks for visual recognition.
We propose to incorporate peripheral position encoding to the multi-head self-attention layers to let the network learn to partition the visual field into diverse peripheral regions given training data.
We evaluate the proposed network, dubbed PerViT, on the large-scale ImageNet dataset and systematically investigate the inner workings of the model for machine perception.
arXiv Detail & Related papers (2022-06-14T12:47:47Z) - Spot the Difference: A Cooperative Object-Referring Game in
Non-Perfectly Co-Observable Scene [47.7861036048079]
This paper proposes an object-referring game in non-perfectly co-observable visual scene.
The goal is to spot the difference between the similar visual scenes through conversing in natural language.
We construct a large-scale multimodal dataset, named SpotDiff, which contains 87k Virtual Reality images and 97k dialogs generated by self-play.
arXiv Detail & Related papers (2022-03-16T02:55:33Z) - Learning Object-Centric Representations of Multi-Object Scenes from
Multiple Views [9.556376932449187]
Multi-View and Multi-Object Network (MulMON) is a method for learning accurate, object-centric representations of multi-object scenes by leveraging multiple views.
We show that MulMON better-resolves spatial ambiguities than single-view methods.
arXiv Detail & Related papers (2021-11-13T13:54:28Z) - Space-time Neural Irradiance Fields for Free-Viewpoint Video [54.436478702701244]
We present a method that learns a neural irradiance field for dynamic scenes from a single video.
Our learned representation enables free-view rendering of the input video.
arXiv Detail & Related papers (2020-11-25T18:59:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.