Unsupervised Object Learning via Common Fate
- URL: http://arxiv.org/abs/2110.06562v2
- Date: Mon, 15 May 2023 12:22:51 GMT
- Title: Unsupervised Object Learning via Common Fate
- Authors: Matthias Tangemann, Steffen Schneider, Julius von K\"ugelgen,
Francesco Locatello, Peter Gehler, Thomas Brox, Matthias K\"ummerer, Matthias
Bethge, Bernhard Sch\"olkopf
- Abstract summary: Learning generative object models from unlabelled videos is a long standing problem and required for causal scene modeling.
We decompose this problem into three easier subtasks, and provide candidate solutions for each of them.
We show that our approach allows learning generative models that generalize beyond the occlusions present in the input videos.
- Score: 61.14802390241075
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning generative object models from unlabelled videos is a long standing
problem and required for causal scene modeling. We decompose this problem into
three easier subtasks, and provide candidate solutions for each of them.
Inspired by the Common Fate Principle of Gestalt Psychology, we first extract
(noisy) masks of moving objects via unsupervised motion segmentation. Second,
generative models are trained on the masks of the background and the moving
objects, respectively. Third, background and foreground models are combined in
a conditional "dead leaves" scene model to sample novel scene configurations
where occlusions and depth layering arise naturally. To evaluate the individual
stages, we introduce the Fishbowl dataset positioned between complex real-world
scenes and common object-centric benchmarks of simplistic objects. We show that
our approach allows learning generative models that generalize beyond the
occlusions present in the input videos, and represent scenes in a modular
fashion that allows sampling plausible scenes outside the training distribution
by permitting, for instance, object numbers or densities not observed in the
training set.
Related papers
- Object-level Scene Deocclusion [92.39886029550286]
We present a new self-supervised PArallel visible-to-COmplete diffusion framework, named PACO, for object-level scene deocclusion.
To train PACO, we create a large-scale dataset with 500k samples to enable self-supervised learning.
Experiments on COCOA and various real-world scenes demonstrate the superior capability of PACO for scene deocclusion, surpassing the state of the arts by a large margin.
arXiv Detail & Related papers (2024-06-11T20:34:10Z) - Neural Rendering of Humans in Novel View and Pose from Monocular Video [68.37767099240236]
We introduce a new method that generates photo-realistic humans under novel views and poses given a monocular video as input.
Our method significantly outperforms existing approaches under unseen poses and novel views given monocular videos as input.
arXiv Detail & Related papers (2022-04-04T03:09:20Z) - Towards 3D Scene Understanding by Referring Synthetic Models [65.74211112607315]
Methods typically alleviate on-extensive annotations on real scene scans.
We explore how synthetic models rely on real scene categories of synthetic features to a unified feature space.
Experiments show that our method achieves the average mAP of 46.08% on the ScanNet S3DIS dataset and 55.49% by learning datasets.
arXiv Detail & Related papers (2022-03-20T13:06:15Z) - Learning Multi-Object Dynamics with Compositional Neural Radiance Fields [63.424469458529906]
We present a method to learn compositional predictive models from image observations based on implicit object encoders, Neural Radiance Fields (NeRFs), and graph neural networks.
NeRFs have become a popular choice for representing scenes due to their strong 3D prior.
For planning, we utilize RRTs in the learned latent space, where we can exploit our model and the implicit object encoder to make sampling the latent space informative and more efficient.
arXiv Detail & Related papers (2022-02-24T01:31:29Z) - Conditional Object-Centric Learning from Video [34.012087337046005]
We introduce a sequential extension to Slot Attention to predict optical flow for realistic looking synthetic scenes.
We show that conditioning the initial state of this model on a small set of hints, such as center of mass of objects in the first frame, is sufficient to significantly improve instance segmentation.
These benefits generalize beyond the training distribution to novel objects, novel backgrounds, and to longer video sequences.
arXiv Detail & Related papers (2021-11-24T16:10:46Z) - Visiting the Invisible: Layer-by-Layer Completed Scene Decomposition [57.088328223220934]
Existing scene understanding systems mainly focus on recognizing the visible parts of a scene, ignoring the intact appearance of physical objects in the real-world.
In this work, we propose a higher-level scene understanding system to tackle both visible and invisible parts of objects and backgrounds in a given scene.
arXiv Detail & Related papers (2021-04-12T11:37:23Z) - GIRAFFE: Representing Scenes as Compositional Generative Neural Feature
Fields [45.21191307444531]
Deep generative models allow for photorealistic image synthesis at high resolutions.
But for many applications, this is not enough: content creation also needs to be controllable.
Our key hypothesis is that incorporating a compositional 3D scene representation into the generative model leads to more controllable image synthesis.
arXiv Detail & Related papers (2020-11-24T14:14:15Z) - Towards causal generative scene models via competition of experts [26.181132737834826]
We present an alternative approach which uses an inductive bias encouraging modularity by training an ensemble of generative models (experts)
During training, experts compete for explaining parts of a scene, and thus specialise on different object classes, with objects being identified as parts that re-occur across multiple scenes.
Our model allows for controllable sampling of individual objects and recombination of experts in physically plausible ways.
arXiv Detail & Related papers (2020-04-27T16:10:04Z) - Object-Centric Image Generation with Factored Depths, Locations, and
Appearances [30.541425619507184]
We present a generative model of images that explicitly reasons over the set of objects they show.
Our model learns a structured latent representation that separates objects from each other and from the background.
It can be trained from images alone in a purely unsupervised fashion without the need for object masks or depth information.
arXiv Detail & Related papers (2020-04-01T18:00:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.