OVTrack: Open-Vocabulary Multiple Object Tracking
- URL: http://arxiv.org/abs/2304.08408v1
- Date: Mon, 17 Apr 2023 16:20:05 GMT
- Title: OVTrack: Open-Vocabulary Multiple Object Tracking
- Authors: Siyuan Li, Tobias Fischer, Lei Ke, Henghui Ding, Martin Danelljan,
Fisher Yu
- Abstract summary: OVTrack is an open-vocabulary tracker capable of tracking arbitrary object classes.
It sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark.
- Score: 64.73379741435255
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to recognize, localize and track dynamic objects in a scene is
fundamental to many real-world applications, such as self-driving and robotic
systems. Yet, traditional multiple object tracking (MOT) benchmarks rely only
on a few object categories that hardly represent the multitude of possible
objects that are encountered in the real world. This leaves contemporary MOT
methods limited to a small set of pre-defined object categories. In this paper,
we address this limitation by tackling a novel task, open-vocabulary MOT, that
aims to evaluate tracking beyond pre-defined training categories. We further
develop OVTrack, an open-vocabulary tracker that is capable of tracking
arbitrary object classes. Its design is based on two key ingredients: First,
leveraging vision-language models for both classification and association via
knowledge distillation; second, a data hallucination strategy for robust
appearance feature learning from denoising diffusion probabilistic models. The
result is an extremely data-efficient open-vocabulary tracker that sets a new
state-of-the-art on the large-scale, large-vocabulary TAO benchmark, while
being trained solely on static images. Project page:
https://www.vis.xyz/pub/ovtrack/
Related papers
- VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking [61.56592503861093]
This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT)
Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens.
We propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint.
arXiv Detail & Related papers (2024-10-11T05:01:49Z) - Open3DTrack: Towards Open-Vocabulary 3D Multi-Object Tracking [73.05477052645885]
We introduce open-vocabulary 3D tracking, which extends the scope of 3D tracking to include objects beyond predefined categories.
We propose a novel approach that integrates open-vocabulary capabilities into a 3D tracking framework, allowing for generalization to unseen object classes.
arXiv Detail & Related papers (2024-10-02T15:48:42Z) - Multi-Granularity Language-Guided Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity.
At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions.
Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z) - Siamese-DETR for Generic Multi-Object Tracking [16.853363984562602]
Traditional Multi-Object Tracking (MOT) is limited to tracking objects belonging to the pre-defined closed-set categories.
Siamese-DETR is proposed to track objects beyond pre-defined categories with the given text prompt and template image.
Siamese-DETR surpasses existing MOT methods on GMOT-40 dataset by a large margin.
arXiv Detail & Related papers (2023-10-27T03:32:05Z) - Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models [28.304047711166056]
Large-scale pre-trained models have shown promising advances in detecting and segmenting objects in 2D static images in the wild.
This begs the question: can we re-purpose these large-scale pre-trained static image models for open-vocabulary video tracking?
In this paper, we re-purpose an open-vocabulary detector, segmenter, and dense optical flow estimator, into a model that tracks and segments objects of any category in 2D videos.
arXiv Detail & Related papers (2023-10-10T20:25:30Z) - Unifying Tracking and Image-Video Object Detection [54.91658924277527]
TrIVD (Tracking and Image-Video Detection) is the first framework that unifies image OD, video OD, and MOT within one end-to-end model.
To handle the discrepancies and semantic overlaps of category labels, TrIVD formulates detection/tracking as grounding and reasons about object categories.
arXiv Detail & Related papers (2022-11-20T20:30:28Z) - TAO: A Large-Scale Benchmark for Tracking Any Object [95.87310116010185]
Tracking Any Object dataset consists of 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average.
We ask annotators to label objects that move at any point in the video, and give names to them post factum.
Our vocabulary is both significantly larger and qualitatively different from existing tracking datasets.
arXiv Detail & Related papers (2020-05-20T21:07:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.