Unifying Tracking and Image-Video Object Detection
- URL: http://arxiv.org/abs/2211.11077v2
- Date: Sun, 19 Nov 2023 23:45:09 GMT
- Title: Unifying Tracking and Image-Video Object Detection
- Authors: Peirong Liu, Rui Wang, Pengchuan Zhang, Omid Poursaeed, Yipin Zhou,
Xuefei Cao, Sreya Dutta Roy, Ashish Shah, Ser-Nam Lim
- Abstract summary: TrIVD (Tracking and Image-Video Detection) is the first framework that unifies image OD, video OD, and MOT within one end-to-end model.
To handle the discrepancies and semantic overlaps of category labels, TrIVD formulates detection/tracking as grounding and reasons about object categories.
- Score: 54.91658924277527
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Objection detection (OD) has been one of the most fundamental tasks in
computer vision. Recent developments in deep learning have pushed the
performance of image OD to new heights by learning-based, data-driven
approaches. On the other hand, video OD remains less explored, mostly due to
much more expensive data annotation needs. At the same time, multi-object
tracking (MOT) which requires reasoning about track identities and
spatio-temporal trajectories, shares similar spirits with video OD. However,
most MOT datasets are class-specific (e.g., person-annotated only), which
constrains a model's flexibility to perform tracking on other objects. We
propose TrIVD (Tracking and Image-Video Detection), the first framework that
unifies image OD, video OD, and MOT within one end-to-end model. To handle the
discrepancies and semantic overlaps of category labels across datasets, TrIVD
formulates detection/tracking as grounding and reasons about object categories
via visual-text alignments. The unified formulation enables cross-dataset,
multi-task training, and thus equips TrIVD with the ability to leverage
frame-level features, video-level spatio-temporal relations, as well as track
identity associations. With such joint training, we can now extend the
knowledge from OD data, that comes with much richer object category
annotations, to MOT and achieve zero-shot tracking capability. Experiments
demonstrate that multi-task co-trained TrIVD outperforms single-task baselines
across all image/video OD and MOT tasks. We further set the first baseline on
the new task of zero-shot tracking.
Related papers
- VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking [61.56592503861093]
This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT)
Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens.
We propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint.
arXiv Detail & Related papers (2024-10-11T05:01:49Z) - Single-Shot and Multi-Shot Feature Learning for Multi-Object Tracking [55.13878429987136]
We propose a simple yet effective two-stage feature learning paradigm to jointly learn single-shot and multi-shot features for different targets.
Our method has achieved significant improvements on MOT17 and MOT20 datasets while reaching state-of-the-art performance on DanceTrack dataset.
arXiv Detail & Related papers (2023-11-17T08:17:49Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - OVTrack: Open-Vocabulary Multiple Object Tracking [64.73379741435255]
OVTrack is an open-vocabulary tracker capable of tracking arbitrary object classes.
It sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark.
arXiv Detail & Related papers (2023-04-17T16:20:05Z) - Unified Transformer Tracker for Object Tracking [58.65901124158068]
We present the Unified Transformer Tracker (UTT) to address tracking problems in different scenarios with one paradigm.
A track transformer is developed in our UTT to track the target in both Single Object Tracking (SOT) and Multiple Object Tracking (MOT)
arXiv Detail & Related papers (2022-03-29T01:38:49Z) - Probabilistic 3D Multi-Modal, Multi-Object Tracking for Autonomous
Driving [22.693895321632507]
We propose a probabilistic, multi-modal, multi-object tracking system consisting of different trainable modules.
We show that our method outperforms current state-of-the-art on the NuScenes Tracking dataset.
arXiv Detail & Related papers (2020-12-26T15:00:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.