Diffusion Dataset Generation: Towards Closing the Sim2Real Gap for
Pedestrian Detection
- URL: http://arxiv.org/abs/2305.09401v1
- Date: Tue, 16 May 2023 12:33:51 GMT
- Title: Diffusion Dataset Generation: Towards Closing the Sim2Real Gap for
Pedestrian Detection
- Authors: Andrew Farley, Mohsen Zand, Michael Greenspan
- Abstract summary: We propose a novel method of synthetic data creation meant to close the sim2real gap for the pedestrian detection task.
Our method uses a diffusion-based architecture to learn a real-world distribution which, once trained, is used to generate datasets.
We show that training on a combination of generated and simulated data increases average precision by as much as 27.3% for pedestrian detection models in real-world data.
- Score: 0.11470070927586014
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a method that augments a simulated dataset using diffusion models
to improve the performance of pedestrian detection in real-world data. The high
cost of collecting and annotating data in the real-world has motivated the use
of simulation platforms to create training datasets. While simulated data is
inexpensive to collect and annotate, it unfortunately does not always closely
match the distribution of real-world data, which is known as the sim2real gap.
In this paper we propose a novel method of synthetic data creation meant to
close the sim2real gap for the challenging pedestrian detection task. Our
method uses a diffusion-based architecture to learn a real-world distribution
which, once trained, is used to generate datasets. We mix this generated data
with simulated data as a form of augmentation and show that training on a
combination of generated and simulated data increases average precision by as
much as 27.3% for pedestrian detection models in real-world data, compared
against training on purely simulated data.
Related papers
- Improving Offline Reinforcement Learning with Inaccurate Simulators [34.54402525918925]
We propose a novel approach to combine the offline dataset and the inaccurate simulation data in a better manner.
Specifically, we pre-train a generative adversarial network (GAN) model to fit the state distribution of the offline dataset.
Our experimental results in the D4RL benchmark and a real-world manipulation task confirm that our method can benefit more from both inaccurate simulator and limited offline datasets to achieve better performance than the state-of-the-art methods.
arXiv Detail & Related papers (2024-05-07T13:29:41Z) - Are NeRFs ready for autonomous driving? Towards closing the real-to-simulation gap [6.393953433174051]
We propose a novel perspective for addressing the real-to-simulated data gap.
We conduct the first large-scale investigation into the real-to-simulated data gap in an autonomous driving setting.
Our results show notable improvements in model robustness to simulated data, even improving real-world performance in some cases.
arXiv Detail & Related papers (2024-03-24T11:09:41Z) - Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks.
Such models tend to be large and require commensurate volumes of training data.
It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs.
Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z) - A novel algorithm can generate data to train machine learning models in
conditions of extreme scarcity of real world data [0.0]
We propose an algorithm to generate large artificial datasets to train machine learning models.
The performance of the neural network on a batch of real world data is considered a surrogate for the fitness of the generated dataset.
In conditions of simulated extreme scarcity of real world data, mean accuracy of machine learning models trained on generated data was significantly higher than mean accuracy of comparable models trained on scarce real world data.
arXiv Detail & Related papers (2023-05-01T16:24:40Z) - Quantifying the LiDAR Sim-to-Real Domain Shift: A Detailed Investigation
Using Object Detectors and Analyzing Point Clouds at Target-Level [1.1999555634662635]
LiDAR object detection algorithms based on neural networks for autonomous driving require large amounts of data for training, validation, and testing.
We show that using simulated data for the training of neural networks leads to a domain shift of training and testing data due to differences in scenes, scenarios, and distributions.
arXiv Detail & Related papers (2023-03-03T12:52:01Z) - One-Shot Domain Adaptive and Generalizable Semantic Segmentation with
Class-Aware Cross-Domain Transformers [96.51828911883456]
Unsupervised sim-to-real domain adaptation (UDA) for semantic segmentation aims to improve the real-world test performance of a model trained on simulated data.
Traditional UDA often assumes that there are abundant unlabeled real-world data samples available during training for the adaptation.
We explore the one-shot unsupervised sim-to-real domain adaptation (OSUDA) and generalization problem, where only one real-world data sample is available.
arXiv Detail & Related papers (2022-12-14T15:54:15Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Robot Learning from Randomized Simulations: A Review [59.992761565399185]
Deep learning has caused a paradigm shift in robotics research, favoring methods that require large amounts of data.
State-of-the-art approaches learn in simulation where data generation is fast as well as inexpensive.
We focus on a technique named 'domain randomization' which is a method for learning from randomized simulations.
arXiv Detail & Related papers (2021-11-01T13:55:41Z) - MLReal: Bridging the gap between training on synthetic data and real
data applications in machine learning [1.9852463786440129]
We describe a novel approach to enhance supervised training on synthetic data with real data features.
In the training stage, the input data are from the synthetic domain and the auto-correlated data are from the real domain.
In the inference/application stage, the input data are from the real subset domain and the mean of the autocorrelated sections are from the synthetic data subset domain.
arXiv Detail & Related papers (2021-09-11T14:43:34Z) - AutoSimulate: (Quickly) Learning Synthetic Data Generation [70.82315853981838]
We propose an efficient alternative for optimal synthetic data generation based on a novel differentiable approximation of the objective.
We demonstrate that the proposed method finds the optimal data distribution faster (up to $50times$), with significantly reduced training data generation (up to $30times$) and better accuracy ($+8.7%$) on real-world test datasets than previous methods.
arXiv Detail & Related papers (2020-08-16T11:36:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.