Conditional Object-Centric Learning from Video
- URL: http://arxiv.org/abs/2111.12594v1
- Date: Wed, 24 Nov 2021 16:10:46 GMT
- Title: Conditional Object-Centric Learning from Video
- Authors: Thomas Kipf, Gamaleldin F. Elsayed, Aravindh Mahendran, Austin Stone,
Sara Sabour, Georg Heigold, Rico Jonschkowski, Alexey Dosovitskiy, Klaus
Greff
- Abstract summary: We introduce a sequential extension to Slot Attention to predict optical flow for realistic looking synthetic scenes.
We show that conditioning the initial state of this model on a small set of hints, such as center of mass of objects in the first frame, is sufficient to significantly improve instance segmentation.
These benefits generalize beyond the training distribution to novel objects, novel backgrounds, and to longer video sequences.
- Score: 34.012087337046005
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Object-centric representations are a promising path toward more systematic
generalization by providing flexible abstractions upon which compositional
world models can be built. Recent work on simple 2D and 3D datasets has shown
that models with object-centric inductive biases can learn to segment and
represent meaningful objects from the statistical structure of the data alone
without the need for any supervision. However, such fully-unsupervised methods
still fail to scale to diverse realistic data, despite the use of increasingly
complex inductive biases such as priors for the size of objects or the 3D
geometry of the scene. In this paper, we instead take a weakly-supervised
approach and focus on how 1) using the temporal dynamics of video data in the
form of optical flow and 2) conditioning the model on simple object location
cues can be used to enable segmenting and tracking objects in significantly
more realistic synthetic data. We introduce a sequential extension to Slot
Attention which we train to predict optical flow for realistic looking
synthetic scenes and show that conditioning the initial state of this model on
a small set of hints, such as center of mass of objects in the first frame, is
sufficient to significantly improve instance segmentation. These benefits
generalize beyond the training distribution to novel objects, novel
backgrounds, and to longer video sequences. We also find that such
initial-state-conditioning can be used during inference as a flexible interface
to query the model for specific objects or parts of objects, which could pave
the way for a range of weakly-supervised approaches and allow more effective
interaction with trained models.
Related papers
- Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - pix2gestalt: Amodal Segmentation by Synthesizing Wholes [34.45464291259217]
pix2gestalt is a framework for zero-shot amodal segmentation.
We learn a conditional diffusion model for reconstructing whole objects in challenging zero-shot cases.
arXiv Detail & Related papers (2024-01-25T18:57:36Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models [28.304047711166056]
Large-scale pre-trained models have shown promising advances in detecting and segmenting objects in 2D static images in the wild.
This begs the question: can we re-purpose these large-scale pre-trained static image models for open-vocabulary video tracking?
In this paper, we re-purpose an open-vocabulary detector, segmenter, and dense optical flow estimator, into a model that tracks and segments objects of any category in 2D videos.
arXiv Detail & Related papers (2023-10-10T20:25:30Z) - UniQuadric: A SLAM Backend for Unknown Rigid Object 3D Tracking and
Light-Weight Modeling [7.626461564400769]
We propose a novel SLAM backend that unifies ego-motion tracking, rigid object motion tracking, and modeling.
Our system showcases the potential application of object perception in complex dynamic scenes.
arXiv Detail & Related papers (2023-09-29T07:50:09Z) - Bridging the Gap to Real-World Object-Centric Learning [66.55867830853803]
We show that reconstructing features from models trained in a self-supervised manner is a sufficient training signal for object-centric representations to arise in a fully unsupervised way.
Our approach, DINOSAUR, significantly out-performs existing object-centric learning models on simulated data.
arXiv Detail & Related papers (2022-09-29T15:24:47Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection
Consistency [114.02182755620784]
We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision.
Our framework is shown to outperform the state-of-the-art depth and motion estimation methods.
arXiv Detail & Related papers (2021-02-04T14:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.