Benchmarking Unsupervised Object Representations for Video Sequences
- URL: http://arxiv.org/abs/2006.07034v2
- Date: Tue, 29 Jun 2021 07:24:56 GMT
- Title: Benchmarking Unsupervised Object Representations for Video Sequences
- Authors: Marissa A. Weis, Kashyap Chitta, Yash Sharma, Wieland Brendel,
Matthias Bethge, Andreas Geiger and Alexander S. Ecker
- Abstract summary: We compare the perceptual abilities of four object-centric approaches: ViMON, OP3, TBA and SCALOR.
Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking.
Our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
- Score: 111.81492107649889
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Perceiving the world in terms of objects and tracking them through time is a
crucial prerequisite for reasoning and scene understanding. Recently, several
methods have been proposed for unsupervised learning of object-centric
representations. However, since these models were evaluated on different
downstream tasks, it remains unclear how they compare in terms of basic
perceptual abilities such as detection, figure-ground segmentation and tracking
of objects. To close this gap, we design a benchmark with four data sets of
varying complexity and seven additional test sets featuring challenging
tracking scenarios relevant for natural videos. Using this benchmark, we
compare the perceptual abilities of four object-centric approaches: ViMON, a
video-extension of MONet, based on recurrent spatial attention, OP3, which
exploits clustering via spatial mixture models, as well as TBA and SCALOR,
which use explicit factorization via spatial transformers. Our results suggest
that the architectures with unconstrained latent representations learn more
powerful representations in terms of object detection, segmentation and
tracking than the spatial transformer based architectures. We also observe that
none of the methods are able to gracefully handle the most challenging tracking
scenarios despite their synthetic nature, suggesting that our benchmark may
provide fruitful guidance towards learning more robust object-centric video
representations.
Related papers
- 3D-Aware Instance Segmentation and Tracking in Egocentric Videos [107.10661490652822]
Egocentric videos present unique challenges for 3D scene understanding.
This paper introduces a novel approach to instance segmentation and tracking in first-person video.
By incorporating spatial and temporal cues, we achieve superior performance compared to state-of-the-art 2D approaches.
arXiv Detail & Related papers (2024-08-19T10:08:25Z) - SeMoLi: What Moves Together Belongs Together [51.72754014130369]
We tackle semi-supervised object detection based on motion cues.
Recent results suggest that motion-based clustering methods can be used to pseudo-label instances of moving objects.
We re-think this approach and suggest that both, object detection, as well as motion-inspired pseudo-labeling, can be tackled in a data-driven manner.
arXiv Detail & Related papers (2024-02-29T18:54:53Z) - DistFormer: Enhancing Local and Global Features for Monocular Per-Object
Distance Estimation [35.6022448037063]
Per-object distance estimation is crucial in safety-critical applications such as autonomous driving, surveillance, and robotics.
Existing approaches rely on two scales: local information (i.e., the bounding box proportions) or global information.
Our work aims to strengthen both local and global cues.
arXiv Detail & Related papers (2024-01-06T10:56:36Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Tracking through Containers and Occluders in the Wild [32.86030395660071]
We introduce $textbfTCOW$, a new benchmark and model for visual tracking through heavy occlusion and containment.
We create a mixture of synthetic and annotated real datasets to support both supervised learning and structured evaluation of model performance.
We evaluate two recent transformer-based video models and find that while they can be surprisingly capable of tracking targets under certain settings of task variation, there remains a considerable performance gap before we can claim a tracking model to have acquired a true notion of object permanence.
arXiv Detail & Related papers (2023-05-04T17:59:58Z) - End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time.
Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z) - Tackling Background Distraction in Video Object Segmentation [7.187425003801958]
A video object segmentation (VOS) aims to densely track certain objects in videos.
One of the main challenges in this task is the existence of background distractors that appear similar to the target objects.
We propose three novel strategies to suppress such distractors.
Our model achieves a comparable performance to contemporary state-of-the-art approaches, even with real-time performance.
arXiv Detail & Related papers (2022-07-14T14:25:19Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Visual Object Recognition in Indoor Environments Using Topologically
Persistent Features [2.2344764434954256]
Object recognition in unseen indoor environments remains a challenging problem for visual perception of mobile robots.
We propose the use of topologically persistent features, which rely on the objects' shape information, to address this challenge.
We implement the proposed method on a real-world robot to demonstrate its usefulness.
arXiv Detail & Related papers (2020-10-07T06:04:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.