Tracking through Containers and Occluders in the Wild
- URL: http://arxiv.org/abs/2305.03052v1
- Date: Thu, 4 May 2023 17:59:58 GMT
- Title: Tracking through Containers and Occluders in the Wild
- Authors: Basile Van Hoorick, Pavel Tokmakov, Simon Stent, Jie Li, Carl Vondrick
- Abstract summary: We introduce $textbfTCOW$, a new benchmark and model for visual tracking through heavy occlusion and containment.
We create a mixture of synthetic and annotated real datasets to support both supervised learning and structured evaluation of model performance.
We evaluate two recent transformer-based video models and find that while they can be surprisingly capable of tracking targets under certain settings of task variation, there remains a considerable performance gap before we can claim a tracking model to have acquired a true notion of object permanence.
- Score: 32.86030395660071
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tracking objects with persistence in cluttered and dynamic environments
remains a difficult challenge for computer vision systems. In this paper, we
introduce $\textbf{TCOW}$, a new benchmark and model for visual tracking
through heavy occlusion and containment. We set up a task where the goal is to,
given a video sequence, segment both the projected extent of the target object,
as well as the surrounding container or occluder whenever one exists. To study
this task, we create a mixture of synthetic and annotated real datasets to
support both supervised learning and structured evaluation of model performance
under various forms of task variation, such as moving or nested containment. We
evaluate two recent transformer-based video models and find that while they can
be surprisingly capable of tracking targets under certain settings of task
variation, there remains a considerable performance gap before we can claim a
tracking model to have acquired a true notion of object permanence.
Related papers
- DeTra: A Unified Model for Object Detection and Trajectory Forecasting [68.85128937305697]
Our approach formulates the union of the two tasks as a trajectory refinement problem.
To tackle this unified task, we design a refinement transformer that infers the presence, pose, and multi-modal future behaviors of objects.
In our experiments, we observe that ourmodel outperforms the state-of-the-art on Argoverse 2 Sensor and Open dataset.
arXiv Detail & Related papers (2024-06-06T18:12:04Z) - Out of Sight, Still in Mind: Reasoning and Planning about Unobserved Objects with Video Tracking Enabled Memory Models [11.126673648719345]
We investigate the problem of encoding object-oriented memory into a multi-object manipulation reasoning framework.
We propose LOOM, which leverage transformer dynamics to encode the history of trajectories given partial-view point clouds.
Our approaches can perform multiple tasks including reasoning with occluded novel objects appearance, and object reappearance.
arXiv Detail & Related papers (2023-09-26T21:31:24Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time.
Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z) - SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric
Models [30.313085784715575]
We introduce SlotFormer -- a Transformer-based autoregressive model on learned object-temporal representations.
In this paper, we successfully apply SlotFormer to perform prediction on datasets with complex object interactions.
We also show its ability to serve as a world model for model-based planning, which is competitive with methods designed specifically for such tasks.
arXiv Detail & Related papers (2022-10-12T01:53:58Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Learning to Track with Object Permanence [61.36492084090744]
We introduce an end-to-end trainable approach for joint object detection and tracking.
Our model, trained jointly on synthetic and real data, outperforms the state of the art on KITTI, and MOT17 datasets.
arXiv Detail & Related papers (2021-03-26T04:43:04Z) - Benchmarking Unsupervised Object Representations for Video Sequences [111.81492107649889]
We compare the perceptual abilities of four object-centric approaches: ViMON, OP3, TBA and SCALOR.
Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking.
Our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
arXiv Detail & Related papers (2020-06-12T09:37:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.