SAVi++: Towards End-to-End Object-Centric Learning from Real-World
Videos
- URL: http://arxiv.org/abs/2206.07764v1
- Date: Wed, 15 Jun 2022 18:57:07 GMT
- Title: SAVi++: Towards End-to-End Object-Centric Learning from Real-World
Videos
- Authors: Gamaleldin F. Elsayed, Aravindh Mahendran, Sjoerd van Steenkiste,
Klaus Greff, Michael C. Mozer, Thomas Kipf
- Abstract summary: We introduce SAVi++, an object-centric video model which is trained to predict depth signals from a slot-based video representation.
By using sparse depth signals obtained from LiDAR, SAVi++ is able to learn emergent object segmentation and tracking from videos in the real-world Open dataset.
- Score: 23.64091569954785
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The visual world can be parsimoniously characterized in terms of distinct
entities with sparse interactions. Discovering this compositional structure in
dynamic visual scenes has proven challenging for end-to-end computer vision
approaches unless explicit instance-level supervision is provided. Slot-based
models leveraging motion cues have recently shown great promise in learning to
represent, segment, and track objects without direct supervision, but they
still fail to scale to complex real-world multi-object videos. In an effort to
bridge this gap, we take inspiration from human development and hypothesize
that information about scene geometry in the form of depth signals can
facilitate object-centric learning. We introduce SAVi++, an object-centric
video model which is trained to predict depth signals from a slot-based video
representation. By further leveraging best practices for model scaling, we are
able to train SAVi++ to segment complex dynamic scenes recorded with moving
cameras, containing both static and moving objects of diverse appearance on
naturalistic backgrounds, without the need for segmentation supervision.
Finally, we demonstrate that by using sparse depth signals obtained from LiDAR,
SAVi++ is able to learn emergent object segmentation and tracking from videos
in the real-world Waymo Open dataset.
Related papers
- Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models [28.304047711166056]
Large-scale pre-trained models have shown promising advances in detecting and segmenting objects in 2D static images in the wild.
This begs the question: can we re-purpose these large-scale pre-trained static image models for open-vocabulary video tracking?
In this paper, we re-purpose an open-vocabulary detector, segmenter, and dense optical flow estimator, into a model that tracks and segments objects of any category in 2D videos.
arXiv Detail & Related papers (2023-10-10T20:25:30Z) - Rethinking Amodal Video Segmentation from Learning Supervised Signals
with Object-centric Representation [47.39455910191075]
Video amodal segmentation is a challenging task in computer vision.
Recent studies have achieved promising performance by using motion flow to integrate information across frames under a self-supervised setting.
This paper presents a rethinking to previous works. We particularly leverage the supervised signals with object-centric representation.
arXiv Detail & Related papers (2023-09-23T04:12:02Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Object Scene Representation Transformer [56.40544849442227]
We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis.
OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods.
It is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder.
arXiv Detail & Related papers (2022-06-14T15:40:47Z) - PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z) - Conditional Object-Centric Learning from Video [34.012087337046005]
We introduce a sequential extension to Slot Attention to predict optical flow for realistic looking synthetic scenes.
We show that conditioning the initial state of this model on a small set of hints, such as center of mass of objects in the first frame, is sufficient to significantly improve instance segmentation.
These benefits generalize beyond the training distribution to novel objects, novel backgrounds, and to longer video sequences.
arXiv Detail & Related papers (2021-11-24T16:10:46Z) - Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection
Consistency [114.02182755620784]
We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision.
Our framework is shown to outperform the state-of-the-art depth and motion estimation methods.
arXiv Detail & Related papers (2021-02-04T14:26:42Z) - DyStaB: Unsupervised Object Segmentation via Dynamic-Static
Bootstrapping [72.84991726271024]
We describe an unsupervised method to detect and segment portions of images of live scenes that are seen moving as a coherent whole.
Our method first partitions the motion field by minimizing the mutual information between segments.
It uses the segments to learn object models that can be used for detection in a static image.
arXiv Detail & Related papers (2020-08-16T22:05:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.