TDT: Teaching Detectors to Track without Fully Annotated Videos
- URL: http://arxiv.org/abs/2205.05583v1
- Date: Wed, 11 May 2022 15:56:17 GMT
- Title: TDT: Teaching Detectors to Track without Fully Annotated Videos
- Authors: Shuzhi Yu, Guanhang Wu, Chunhui Gu, Mohammed E. Fathy
- Abstract summary: One-stage trackers that predict both detections and appearance embeddings in one forward pass received much attention.
Our proposed one-stage solution matches the two-stage counterpart in quality but is 3 times faster.
- Score: 2.8292841621378844
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, one-stage trackers that use a joint model to predict both
detections and appearance embeddings in one forward pass received much
attention and achieved state-of-the-art results on the Multi-Object Tracking
(MOT) benchmarks. However, their success depends on the availability of videos
that are fully annotated with tracking data, which is expensive and hard to
obtain. This can limit the model generalization. In comparison, the two-stage
approach, which performs detection and embedding separately, is slower but
easier to train as their data are easier to annotate. We propose to combine the
best of the two worlds through a data distillation approach. Specifically, we
use a teacher embedder, trained on Re-ID datasets, to generate pseudo
appearance embedding labels for the detection datasets. Then, we use the
augmented dataset to train a detector that is also capable of regressing these
pseudo-embeddings in a fully-convolutional fashion. Our proposed one-stage
solution matches the two-stage counterpart in quality but is 3 times faster.
Even though the teacher embedder has not seen any tracking data during
training, our proposed tracker achieves competitive performance with some
popular trackers (e.g. JDE) trained with fully labeled tracking data.
Related papers
- Multi-object Tracking by Detection and Query: an efficient end-to-end manner [23.926668750263488]
Multi-object tracking is advancing through two dominant paradigms: traditional tracking by detection and newly emerging tracking by query.
We propose the tracking-by-detection-and-query paradigm, which is achieved by a Learnable Associator.
Compared to tracking-by-query models, LAID achieves competitive tracking accuracy with notably higher training efficiency.
arXiv Detail & Related papers (2024-11-09T14:38:08Z) - CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos [63.90674869153876]
We introduce CoTracker3, comprising a new tracking model and a new semi-supervised training recipe.
This allows real videos without annotations to be used during training by generating pseudo-labels using off-the-shelf teachers.
The model is available in online and offline variants and reliably tracks visible and occluded points.
arXiv Detail & Related papers (2024-10-15T17:56:32Z) - Temporal Correlation Meets Embedding: Towards a 2nd Generation of JDE-based Real-Time Multi-Object Tracking [52.04679257903805]
Joint Detection and Embedding (JDE) trackers have demonstrated excellent performance in Multi-Object Tracking (MOT) tasks.
Our tracker, named TCBTrack, achieves state-of-the-art performance on multiple public benchmarks.
arXiv Detail & Related papers (2024-07-19T07:48:45Z) - Bridging the Gap Between End-to-end and Non-End-to-end Multi-Object
Tracking [27.74953961900086]
Existing end-to-end Multi-Object Tracking (e2e-MOT) methods have not surpassed non-end-to-end tracking-by-detection methods.
We present Co-MOT, a simple and effective method to facilitate e2e-MOT by a novel coopetition label assignment with a shadow concept.
arXiv Detail & Related papers (2023-05-22T05:18:34Z) - Unifying Tracking and Image-Video Object Detection [54.91658924277527]
TrIVD (Tracking and Image-Video Detection) is the first framework that unifies image OD, video OD, and MOT within one end-to-end model.
To handle the discrepancies and semantic overlaps of category labels, TrIVD formulates detection/tracking as grounding and reasons about object categories.
arXiv Detail & Related papers (2022-11-20T20:30:28Z) - Tracking Every Thing in the Wild [61.917043381836656]
We introduce a new metric, Track Every Thing Accuracy (TETA), breaking tracking measurement into three sub-factors: localization, association, and classification.
Our experiments show that TETA evaluates trackers more comprehensively, and TETer achieves significant improvements on the challenging large-scale datasets BDD100K and TAO.
arXiv Detail & Related papers (2022-07-26T15:37:19Z) - Cannot See the Forest for the Trees: Aggregating Multiple Viewpoints to
Better Classify Objects in Videos [36.28269135795851]
We present a set classifier that improves accuracy of classifying tracklets by aggregating information from multiple viewpoints contained in a tracklet.
By simply attaching our method to QDTrack on top of ResNet-101, we achieve the new state-of-the-art, 19.9% and 15.7% TrackAP_50 on TAO validation and test sets.
arXiv Detail & Related papers (2022-06-05T07:51:58Z) - Unified Transformer Tracker for Object Tracking [58.65901124158068]
We present the Unified Transformer Tracker (UTT) to address tracking problems in different scenarios with one paradigm.
A track transformer is developed in our UTT to track the target in both Single Object Tracking (SOT) and Multiple Object Tracking (MOT)
arXiv Detail & Related papers (2022-03-29T01:38:49Z) - Benchmarking Deep Trackers on Aerial Videos [5.414308305392762]
In this paper, we compare ten trackers based on deep learning techniques on four aerial datasets.
We choose top performing trackers utilizing different approaches, specifically tracking by detection, discriminative correlation filters, Siamese networks and reinforcement learning.
Our findings indicate that the trackers perform significantly worse in aerial datasets compared to standard ground level videos.
arXiv Detail & Related papers (2021-03-24T01:45:19Z) - DEFT: Detection Embeddings for Tracking [3.326320568999945]
We propose an efficient joint detection and tracking model named DEFT.
Our approach relies on an appearance-based object matching network jointly-learned with an underlying object detection network.
DEFT has comparable accuracy and speed to the top methods on 2D online tracking leaderboards.
arXiv Detail & Related papers (2021-02-03T20:00:44Z) - Self-Supervised Person Detection in 2D Range Data using a Calibrated
Camera [83.31666463259849]
We propose a method to automatically generate training labels (called pseudo-labels) for 2D LiDAR-based person detectors.
We show that self-supervised detectors, trained or fine-tuned with pseudo-labels, outperform detectors trained using manual annotations.
Our method is an effective way to improve person detectors during deployment without any additional labeling effort.
arXiv Detail & Related papers (2020-12-16T12:10:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.