Multi-Object Tracking with Hallucinated and Unlabeled Videos
- URL: http://arxiv.org/abs/2108.08836v1
- Date: Thu, 19 Aug 2021 17:57:29 GMT
- Title: Multi-Object Tracking with Hallucinated and Unlabeled Videos
- Authors: Daniel McKee, Bing Shuai, Andrew Berneshawi, Manchen Wang, Davide
Modolo, Svetlana Lazebnik, Joseph Tighe
- Abstract summary: In place of tracking annotations, we first hallucinate videos with bounding box annotations using zoom-in/out motion transformations.
We then mine hard examples across an unlabeled pool of real videos with a tracker trained on our hallucinated video data.
Our weakly supervised tracker achieves state-of-the-art performance on the MOT17 and TAO-person datasets.
- Score: 34.38275236770619
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we explore learning end-to-end deep neural trackers without
tracking annotations. This is important as large-scale training data is
essential for training deep neural trackers while tracking annotations are
expensive to acquire. In place of tracking annotations, we first hallucinate
videos from images with bounding box annotations using zoom-in/out motion
transformations to obtain free tracking labels. We add video simulation
augmentations to create a diverse tracking dataset, albeit with simple motion.
Next, to tackle harder tracking cases, we mine hard examples across an
unlabeled pool of real videos with a tracker trained on our hallucinated video
data. For hard example mining, we propose an optimization-based connecting
process to first identify and then rectify hard examples from the pool of
unlabeled videos. Finally, we train our tracker jointly on hallucinated data
and mined hard video examples. Our weakly supervised tracker achieves
state-of-the-art performance on the MOT17 and TAO-person datasets. On MOT17, we
further demonstrate that the combination of our self-generated data and the
existing manually-annotated data leads to additional improvements.
Related papers
- Accelerated Video Annotation driven by Deep Detector and Tracker [12.640283469603355]
Annotating object ground truth in videos is vital for several downstream tasks in robot perception and machine learning.
The accuracy of the annotated instances of the moving objects on every image frame in a video is crucially important.
We propose a new annotation method which leverages a combination of a learning-based detector and a learning-based tracker.
arXiv Detail & Related papers (2023-02-19T15:16:05Z) - Weakly-Supervised Action Detection Guided by Audio Narration [50.4318060593995]
We propose a model to learn from the narration supervision and utilize multimodal features, including RGB, motion flow, and ambient sound.
Our experiments show that noisy audio narration suffices to learn a good action detection model, thus reducing annotation expenses.
arXiv Detail & Related papers (2022-05-12T06:33:24Z) - TDT: Teaching Detectors to Track without Fully Annotated Videos [2.8292841621378844]
One-stage trackers that predict both detections and appearance embeddings in one forward pass received much attention.
Our proposed one-stage solution matches the two-stage counterpart in quality but is 3 times faster.
arXiv Detail & Related papers (2022-05-11T15:56:17Z) - Learning to Track Objects from Unlabeled Videos [63.149201681380305]
In this paper, we propose to learn an Unsupervised Single Object Tracker (USOT) from scratch.
To narrow the gap between unsupervised trackers and supervised counterparts, we propose an effective unsupervised learning approach composed of three stages.
Experiments show that the proposed USOT learned from unlabeled videos performs well over the state-of-the-art unsupervised trackers by large margins.
arXiv Detail & Related papers (2021-08-28T22:10:06Z) - MOTSynth: How Can Synthetic Data Help Pedestrian Detection and Tracking? [36.094861549144426]
Deep learning methods for video pedestrian detection and tracking require large volumes of training data to achieve good performance.
We generate MOT Synth, a large, highly diverse synthetic dataset for object detection and tracking using a rendering game engine.
Our experiments show that MOT Synth can be used as a replacement for real data on tasks such as pedestrian detection, re-identification, segmentation, and tracking.
arXiv Detail & Related papers (2021-08-21T14:25:25Z) - Learning to Track Instances without Video Annotations [85.9865889886669]
We introduce a novel semi-supervised framework by learning instance tracking networks with only a labeled image dataset and unlabeled video sequences.
We show that even when only trained with images, the learned feature representation is robust to instance appearance variations.
In addition, we integrate this module into single-stage instance segmentation and pose estimation frameworks.
arXiv Detail & Related papers (2021-04-01T06:47:41Z) - Benchmarking Deep Trackers on Aerial Videos [5.414308305392762]
In this paper, we compare ten trackers based on deep learning techniques on four aerial datasets.
We choose top performing trackers utilizing different approaches, specifically tracking by detection, discriminative correlation filters, Siamese networks and reinforcement learning.
Our findings indicate that the trackers perform significantly worse in aerial datasets compared to standard ground level videos.
arXiv Detail & Related papers (2021-03-24T01:45:19Z) - Unsupervised Deep Representation Learning for Real-Time Tracking [137.69689503237893]
We propose an unsupervised learning method for visual tracking.
The motivation of our unsupervised learning is that a robust tracker should be effective in bidirectional tracking.
We build our framework on a Siamese correlation filter network, and propose a multi-frame validation scheme and a cost-sensitive loss to facilitate unsupervised learning.
arXiv Detail & Related papers (2020-07-22T08:23:12Z) - Labelling unlabelled videos from scratch with multi-modal
self-supervision [82.60652426371936]
unsupervised labelling of a video dataset does not come for free from strong feature encoders.
We propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations.
An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth human labels.
arXiv Detail & Related papers (2020-06-24T12:28:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.