Multi-Object Tracking with Hallucinated and Unlabeled Videos
- URL: http://arxiv.org/abs/2108.08836v1
- Date: Thu, 19 Aug 2021 17:57:29 GMT
- Title: Multi-Object Tracking with Hallucinated and Unlabeled Videos
- Authors: Daniel McKee, Bing Shuai, Andrew Berneshawi, Manchen Wang, Davide
Modolo, Svetlana Lazebnik, Joseph Tighe
- Abstract summary: In place of tracking annotations, we first hallucinate videos with bounding box annotations using zoom-in/out motion transformations.
We then mine hard examples across an unlabeled pool of real videos with a tracker trained on our hallucinated video data.
Our weakly supervised tracker achieves state-of-the-art performance on the MOT17 and TAO-person datasets.
- Score: 34.38275236770619
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we explore learning end-to-end deep neural trackers without
tracking annotations. This is important as large-scale training data is
essential for training deep neural trackers while tracking annotations are
expensive to acquire. In place of tracking annotations, we first hallucinate
videos from images with bounding box annotations using zoom-in/out motion
transformations to obtain free tracking labels. We add video simulation
augmentations to create a diverse tracking dataset, albeit with simple motion.
Next, to tackle harder tracking cases, we mine hard examples across an
unlabeled pool of real videos with a tracker trained on our hallucinated video
data. For hard example mining, we propose an optimization-based connecting
process to first identify and then rectify hard examples from the pool of
unlabeled videos. Finally, we train our tracker jointly on hallucinated data
and mined hard video examples. Our weakly supervised tracker achieves
state-of-the-art performance on the MOT17 and TAO-person datasets. On MOT17, we
further demonstrate that the combination of our self-generated data and the
existing manually-annotated data leads to additional improvements.
Related papers
- CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos [63.90674869153876]
We introduce CoTracker3, comprising a new tracking model and a new semi-supervised training recipe.
This allows real videos without annotations to be used during training by generating pseudo-labels using off-the-shelf teachers.
The model is available in online and offline variants and reliably tracks visible and occluded points.
arXiv Detail & Related papers (2024-10-15T17:56:32Z) - Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Appearance Graphs [117.67620297750685]
We introduce Walker, the first self-supervised tracker that learns from videos with sparse bounding box annotations, and no tracking labels.
Walker is the first self-supervised tracker to achieve competitive performance on MOT17, DanceTrack, and BDD100K.
arXiv Detail & Related papers (2024-09-25T18:00:00Z) - Accelerated Video Annotation driven by Deep Detector and Tracker [12.640283469603355]
Annotating object ground truth in videos is vital for several downstream tasks in robot perception and machine learning.
The accuracy of the annotated instances of the moving objects on every image frame in a video is crucially important.
We propose a new annotation method which leverages a combination of a learning-based detector and a learning-based tracker.
arXiv Detail & Related papers (2023-02-19T15:16:05Z) - TDT: Teaching Detectors to Track without Fully Annotated Videos [2.8292841621378844]
One-stage trackers that predict both detections and appearance embeddings in one forward pass received much attention.
Our proposed one-stage solution matches the two-stage counterpart in quality but is 3 times faster.
arXiv Detail & Related papers (2022-05-11T15:56:17Z) - MOTSynth: How Can Synthetic Data Help Pedestrian Detection and Tracking? [36.094861549144426]
Deep learning methods for video pedestrian detection and tracking require large volumes of training data to achieve good performance.
We generate MOT Synth, a large, highly diverse synthetic dataset for object detection and tracking using a rendering game engine.
Our experiments show that MOT Synth can be used as a replacement for real data on tasks such as pedestrian detection, re-identification, segmentation, and tracking.
arXiv Detail & Related papers (2021-08-21T14:25:25Z) - Learning to Track Instances without Video Annotations [85.9865889886669]
We introduce a novel semi-supervised framework by learning instance tracking networks with only a labeled image dataset and unlabeled video sequences.
We show that even when only trained with images, the learned feature representation is robust to instance appearance variations.
In addition, we integrate this module into single-stage instance segmentation and pose estimation frameworks.
arXiv Detail & Related papers (2021-04-01T06:47:41Z) - Unsupervised Deep Representation Learning for Real-Time Tracking [137.69689503237893]
We propose an unsupervised learning method for visual tracking.
The motivation of our unsupervised learning is that a robust tracker should be effective in bidirectional tracking.
We build our framework on a Siamese correlation filter network, and propose a multi-frame validation scheme and a cost-sensitive loss to facilitate unsupervised learning.
arXiv Detail & Related papers (2020-07-22T08:23:12Z) - Labelling unlabelled videos from scratch with multi-modal
self-supervision [82.60652426371936]
unsupervised labelling of a video dataset does not come for free from strong feature encoders.
We propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations.
An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth human labels.
arXiv Detail & Related papers (2020-06-24T12:28:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.