MLReal: Bridging the gap between training on synthetic data and real
data applications in machine learning
- URL: http://arxiv.org/abs/2109.05294v1
- Date: Sat, 11 Sep 2021 14:43:34 GMT
- Title: MLReal: Bridging the gap between training on synthetic data and real
data applications in machine learning
- Authors: Tariq Alkhalifah, Hanchen Wang, Oleg Ovcharenko
- Abstract summary: We describe a novel approach to enhance supervised training on synthetic data with real data features.
In the training stage, the input data are from the synthetic domain and the auto-correlated data are from the real domain.
In the inference/application stage, the input data are from the real subset domain and the mean of the autocorrelated sections are from the synthetic data subset domain.
- Score: 1.9852463786440129
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Among the biggest challenges we face in utilizing neural networks trained on
waveform data (i.e., seismic, electromagnetic, or ultrasound) is its
application to real data. The requirement for accurate labels forces us to
develop solutions using synthetic data, where labels are readily available.
However, synthetic data often do not capture the reality of the field/real
experiment, and we end up with poor performance of the trained neural network
(NN) at the inference stage. We describe a novel approach to enhance supervised
training on synthetic data with real data features (domain adaptation).
Specifically, for tasks in which the absolute values of the vertical axis (time
or depth) of the input data are not crucial, like classification, or can be
corrected afterward, like velocity model building using a well-log, we suggest
a series of linear operations on the input so the training and application data
have similar distributions. This is accomplished by applying two operations on
the input data to the NN model: 1) The crosscorrelation of the input data
(i.e., shot gather, seismic image, etc.) with a fixed reference trace from the
same dataset. 2) The convolution of the resulting data with the mean (or a
random sample) of the autocorrelated data from another domain. In the training
stage, the input data are from the synthetic domain and the auto-correlated
data are from the real domain, and random samples from real data are drawn at
every training epoch. In the inference/application stage, the input data are
from the real subset domain and the mean of the autocorrelated sections are
from the synthetic data subset domain. Example applications on passive seismic
data for microseismic event source location determination and active seismic
data for predicting low frequencies are used to demonstrate the power of this
approach in improving the applicability of trained models to real data.
Related papers
- Improving Object Detector Training on Synthetic Data by Starting With a Strong Baseline Methodology [0.14980193397844666]
We propose a methodology for improving the performance of a pre-trained object detector when training on synthetic data.
Our approach focuses on extracting the salient information from synthetic data without forgetting useful features learned from pre-training on real images.
arXiv Detail & Related papers (2024-05-30T08:31:01Z) - Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks.
Such models tend to be large and require commensurate volumes of training data.
It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs.
Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z) - Diffusion Dataset Generation: Towards Closing the Sim2Real Gap for
Pedestrian Detection [0.11470070927586014]
We propose a novel method of synthetic data creation meant to close the sim2real gap for the pedestrian detection task.
Our method uses a diffusion-based architecture to learn a real-world distribution which, once trained, is used to generate datasets.
We show that training on a combination of generated and simulated data increases average precision by as much as 27.3% for pedestrian detection models in real-world data.
arXiv Detail & Related papers (2023-05-16T12:33:51Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - A New Benchmark: On the Utility of Synthetic Data with Blender for Bare
Supervised Learning and Downstream Domain Adaptation [42.2398858786125]
Deep learning in computer vision has achieved great success with the price of large-scale labeled training data.
The uncontrollable data collection process produces non-IID training and test data, where undesired duplication may exist.
To circumvent them, an alternative is to generate synthetic data via 3D rendering with domain randomization.
arXiv Detail & Related papers (2023-03-16T09:03:52Z) - Quantifying the LiDAR Sim-to-Real Domain Shift: A Detailed Investigation
Using Object Detectors and Analyzing Point Clouds at Target-Level [1.1999555634662635]
LiDAR object detection algorithms based on neural networks for autonomous driving require large amounts of data for training, validation, and testing.
We show that using simulated data for the training of neural networks leads to a domain shift of training and testing data due to differences in scenes, scenarios, and distributions.
arXiv Detail & Related papers (2023-03-03T12:52:01Z) - One-Shot Domain Adaptive and Generalizable Semantic Segmentation with
Class-Aware Cross-Domain Transformers [96.51828911883456]
Unsupervised sim-to-real domain adaptation (UDA) for semantic segmentation aims to improve the real-world test performance of a model trained on simulated data.
Traditional UDA often assumes that there are abundant unlabeled real-world data samples available during training for the adaptation.
We explore the one-shot unsupervised sim-to-real domain adaptation (OSUDA) and generalization problem, where only one real-world data sample is available.
arXiv Detail & Related papers (2022-12-14T15:54:15Z) - Dataset Distillation by Matching Training Trajectories [75.9031209877651]
We propose a new formulation that optimize our distilled data to guide networks to a similar state as those trained on real data.
Given a network, we train it for several iterations on our distilled data and optimize the distilled data with respect to the distance between the synthetically trained parameters and the parameters trained on real data.
Our method handily outperforms existing methods and also allows us to distill higher-resolution visual data.
arXiv Detail & Related papers (2022-03-22T17:58:59Z) - Attentive Prototypes for Source-free Unsupervised Domain Adaptive 3D
Object Detection [85.11649974840758]
3D object detection networks tend to be biased towards the data they are trained on.
We propose a single-frame approach for source-free, unsupervised domain adaptation of lidar-based 3D object detectors.
arXiv Detail & Related papers (2021-11-30T18:42:42Z) - STAN: Synthetic Network Traffic Generation with Generative Neural Models [10.54843182184416]
This paper presents STAN (Synthetic network Traffic generation with Autoregressive Neural models), a tool to generate realistic synthetic network traffic datasets.
Our novel neural architecture captures both temporal dependencies and dependence between attributes at any given time.
We evaluate the performance of STAN in terms of the quality of data generated, by training it on both a simulated dataset and a real network traffic data set.
arXiv Detail & Related papers (2020-09-27T04:20:02Z) - Understanding Self-Training for Gradual Domain Adaptation [107.37869221297687]
We consider gradual domain adaptation, where the goal is to adapt an initial classifier trained on a source domain given only unlabeled data that shifts gradually in distribution towards a target domain.
We prove the first non-vacuous upper bound on the error of self-training with gradual shifts, under settings where directly adapting to the target domain can result in unbounded error.
The theoretical analysis leads to algorithmic insights, highlighting that regularization and label sharpening are essential even when we have infinite data, and suggesting that self-training works particularly well for shifts with small Wasserstein-infinity distance.
arXiv Detail & Related papers (2020-02-26T08:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.