MOTSynth: How Can Synthetic Data Help Pedestrian Detection and Tracking?
- URL: http://arxiv.org/abs/2108.09518v1
- Date: Sat, 21 Aug 2021 14:25:25 GMT
- Title: MOTSynth: How Can Synthetic Data Help Pedestrian Detection and Tracking?
- Authors: Matteo Fabbri, Guillem Braso, Gianluca Maugeri, Orcun Cetintas,
Riccardo Gasparini, Aljosa Osep, Simone Calderara, Laura Leal-Taixe, Rita
Cucchiara
- Abstract summary: Deep learning methods for video pedestrian detection and tracking require large volumes of training data to achieve good performance.
We generate MOT Synth, a large, highly diverse synthetic dataset for object detection and tracking using a rendering game engine.
Our experiments show that MOT Synth can be used as a replacement for real data on tasks such as pedestrian detection, re-identification, segmentation, and tracking.
- Score: 36.094861549144426
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning-based methods for video pedestrian detection and tracking
require large volumes of training data to achieve good performance. However,
data acquisition in crowded public environments raises data privacy concerns --
we are not allowed to simply record and store data without the explicit consent
of all participants. Furthermore, the annotation of such data for computer
vision applications usually requires a substantial amount of manual effort,
especially in the video domain. Labeling instances of pedestrians in highly
crowded scenarios can be challenging even for human annotators and may
introduce errors in the training data. In this paper, we study how we can
advance different aspects of multi-person tracking using solely synthetic data.
To this end, we generate MOTSynth, a large, highly diverse synthetic dataset
for object detection and tracking using a rendering game engine. Our
experiments show that MOTSynth can be used as a replacement for real data on
tasks such as pedestrian detection, re-identification, segmentation, and
tracking.
Related papers
- Learning Human Action Recognition Representations Without Real Humans [66.61527869763819]
We present a benchmark that leverages real-world videos with humans removed and synthetic data containing virtual humans to pre-train a model.
We then evaluate the transferability of the representation learned on this data to a diverse set of downstream action recognition benchmarks.
Our approach outperforms previous baselines by up to 5%.
arXiv Detail & Related papers (2023-11-10T18:38:14Z) - ParGANDA: Making Synthetic Pedestrians A Reality For Object Detection [2.7648976108201815]
We propose to use a Generative Adversarial Network (GAN) to close the gap between the real and synthetic data.
Our approach not only produces visually plausible samples but also does not require any labels of the real domain.
arXiv Detail & Related papers (2023-07-21T05:26:32Z) - Unifying Tracking and Image-Video Object Detection [54.91658924277527]
TrIVD (Tracking and Image-Video Detection) is the first framework that unifies image OD, video OD, and MOT within one end-to-end model.
To handle the discrepancies and semantic overlaps of category labels, TrIVD formulates detection/tracking as grounding and reasons about object categories.
arXiv Detail & Related papers (2022-11-20T20:30:28Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - PieTrack: An MOT solution based on synthetic data training and
self-supervised domain adaptation [17.716808322509667]
PieTrack is developed based on synthetic data without using any pre-trained weights.
By leveraging the proposed multi-scale ensemble inference, we achieved a final HOTA score of 58.7 on the MOT17 testing set, ranked third place in the challenge.
arXiv Detail & Related papers (2022-07-22T20:34:49Z) - Virtual passengers for real car solutions: synthetic datasets [2.1028463367241033]
We build a 3D scenario and set-up to resemble reality as closely as possible.
It is possible to configure and vary parameters to add randomness to the scene.
We present the process and concept of synthetic data generation in an automotive context.
arXiv Detail & Related papers (2022-05-13T10:54:39Z) - TDT: Teaching Detectors to Track without Fully Annotated Videos [2.8292841621378844]
One-stage trackers that predict both detections and appearance embeddings in one forward pass received much attention.
Our proposed one-stage solution matches the two-stage counterpart in quality but is 3 times faster.
arXiv Detail & Related papers (2022-05-11T15:56:17Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - Multi-Object Tracking with Hallucinated and Unlabeled Videos [34.38275236770619]
In place of tracking annotations, we first hallucinate videos with bounding box annotations using zoom-in/out motion transformations.
We then mine hard examples across an unlabeled pool of real videos with a tracker trained on our hallucinated video data.
Our weakly supervised tracker achieves state-of-the-art performance on the MOT17 and TAO-person datasets.
arXiv Detail & Related papers (2021-08-19T17:57:29Z) - TAO: A Large-Scale Benchmark for Tracking Any Object [95.87310116010185]
Tracking Any Object dataset consists of 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average.
We ask annotators to label objects that move at any point in the video, and give names to them post factum.
Our vocabulary is both significantly larger and qualitatively different from existing tracking datasets.
arXiv Detail & Related papers (2020-05-20T21:07:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.