SIMONe: View-Invariant, Temporally-Abstracted Object Representations via
Unsupervised Video Decomposition
- URL: http://arxiv.org/abs/2106.03849v1
- Date: Mon, 7 Jun 2021 17:59:23 GMT
- Title: SIMONe: View-Invariant, Temporally-Abstracted Object Representations via
Unsupervised Video Decomposition
- Authors: Rishabh Kabra, Daniel Zoran, Goker Erdogan, Loic Matthey, Antonia
Creswell, Matthew Botvinick, Alexander Lerchner, Christopher P. Burgess
- Abstract summary: We present an unsupervised variational approach to this problem.
Our model learns to infer two sets of latent representations from RGB video input alone.
It represents object attributes in an allocentric manner which does not depend on viewpoint.
- Score: 69.90530987240899
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To help agents reason about scenes in terms of their building blocks, we wish
to extract the compositional structure of any given scene (in particular, the
configuration and characteristics of objects comprising the scene). This
problem is especially difficult when scene structure needs to be inferred while
also estimating the agent's location/viewpoint, as the two variables jointly
give rise to the agent's observations. We present an unsupervised variational
approach to this problem. Leveraging the shared structure that exists across
different scenes, our model learns to infer two sets of latent representations
from RGB video input alone: a set of "object" latents, corresponding to the
time-invariant, object-level contents of the scene, as well as a set of "frame"
latents, corresponding to global time-varying elements such as viewpoint. This
factorization of latents allows our model, SIMONe, to represent object
attributes in an allocentric manner which does not depend on viewpoint.
Moreover, it allows us to disentangle object dynamics and summarize their
trajectories as time-abstracted, view-invariant, per-object properties. We
demonstrate these capabilities, as well as the model's performance in terms of
view synthesis and instance segmentation, across three procedurally generated
video datasets.
Related papers
- 1st Place Solution for MOSE Track in CVPR 2024 PVUW Workshop: Complex Video Object Segmentation [72.54357831350762]
We propose a semantic embedding video object segmentation model and use the salient features of objects as query representations.
We trained our model on a large-scale video object segmentation dataset.
Our model achieves first place (textbf84.45%) in the test set of Complex Video Object Challenge.
arXiv Detail & Related papers (2024-06-07T03:13:46Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Tracking through Containers and Occluders in the Wild [32.86030395660071]
We introduce $textbfTCOW$, a new benchmark and model for visual tracking through heavy occlusion and containment.
We create a mixture of synthetic and annotated real datasets to support both supervised learning and structured evaluation of model performance.
We evaluate two recent transformer-based video models and find that while they can be surprisingly capable of tracking targets under certain settings of task variation, there remains a considerable performance gap before we can claim a tracking model to have acquired a true notion of object permanence.
arXiv Detail & Related papers (2023-05-04T17:59:58Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Object-Centric Representation Learning with Generative Spatial-Temporal
Factorization [5.403549896734018]
We propose Dynamics-aware Multi-Object Network (DyMON), a method that broadens the scope of multi-view object-centric representation learning to dynamic scenes.
We show that DyMON learns to factorize the entangled effects of observer motions and scene object dynamics from a sequence of observations.
We also show that the factorized scene representations support querying about a single object by space and time independently.
arXiv Detail & Related papers (2021-11-09T20:04:16Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z) - Unsupervised Object-Based Transition Models for 3D Partially Observable
Environments [13.598250346370467]
The model is trained end-to-end without supervision using losses at the level of the object-structured representation rather than pixels.
We show that the combination of an object-level loss and correct object alignment over time enables the model to outperform a state-of-the-art baseline.
arXiv Detail & Related papers (2021-03-08T12:10:02Z) - Unsupervised Video Decomposition using Spatio-temporal Iterative
Inference [31.97227651679233]
Multi-object scene decomposition is a fast-emerging problem in learning.
We show that our model has a high accuracy even without color information.
We demonstrate the decomposition, segmentation prediction capabilities of our model and show that it outperforms the state-of-the-art on several benchmark datasets.
arXiv Detail & Related papers (2020-06-25T22:57:17Z) - Benchmarking Unsupervised Object Representations for Video Sequences [111.81492107649889]
We compare the perceptual abilities of four object-centric approaches: ViMON, OP3, TBA and SCALOR.
Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking.
Our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
arXiv Detail & Related papers (2020-06-12T09:37:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.