Related papers: A novel algorithm can generate data to train machine learning models in conditions of extreme scarcity of real world data

A novel algorithm can generate data to train machine learning models in conditions of extreme scarcity of real world data

URL: http://arxiv.org/abs/2305.00987v1
Date: Mon, 1 May 2023 16:24:40 GMT
Title: A novel algorithm can generate data to train machine learning models in conditions of extreme scarcity of real world data
Authors: Olivier Niel
Abstract summary: We propose an algorithm to generate large artificial datasets to train machine learning models. The performance of the neural network on a batch of real world data is considered a surrogate for the fitness of the generated dataset. In conditions of simulated extreme scarcity of real world data, mean accuracy of machine learning models trained on generated data was significantly higher than mean accuracy of comparable models trained on scarce real world data.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Training machine learning models requires large datasets. However, collecting, curating, and operating large and complex sets of real world data poses problems of costs, ethical and legal issues, and data availability. Here we propose a novel algorithm to generate large artificial datasets to train machine learning models in conditions of extreme scarcity of real world data. The algorithm is based on a genetic algorithm, which mutates randomly generated datasets subsequently used for training a neural network. After training, the performance of the neural network on a batch of real world data is considered a surrogate for the fitness of the generated dataset used for its training. As selection pressure is applied to the population of generated datasets, unfit individuals are discarded, and the fitness of the fittest individuals increases through generations. The performance of the data generation algorithm was measured on the Iris dataset and on the Breast Cancer Wisconsin diagnostic dataset. In conditions of real world data abundance, mean accuracy of machine learning models trained on generated data was comparable to mean accuracy of models trained on real world data (0.956 in both cases on the Iris dataset, p = 0.6996, and 0.9377 versus 0.9472 on the Breast Cancer dataset, p = 0.1189). In conditions of simulated extreme scarcity of real world data, mean accuracy of machine learning models trained on generated data was significantly higher than mean accuracy of comparable models trained on scarce real world data (0.9533 versus 0.9067 on the Iris dataset, p < 0.0001, and 0.8692 versus 0.7701 on the Breast Cancer dataset, p = 0.0091). In conclusion, this novel algorithm can generate large artificial datasets to train machine learning models, in conditions of extreme scarcity of real world data, or when cost or data sensitivity prevent the collection of large real world datasets.

Related papers

Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World [19.266191284270793]
generative machine learning models are pretrained on web-scale datasets containing data generated by earlier models. Some prior work warns of "model collapse" as the web is overwhelmed by synthetic data. We report experiments on three ways of using data (training-workflows) across three generative model task-settings.
arXiv Detail & Related papers (2024-10-22T05:49:24Z)
Self-Correcting Self-Consuming Loops for Generative Model Training [16.59453827606427]
Machine learning models are increasingly trained on a mix of human- and machine-generated data. Despite the successful stories of using synthetic data for representation learning, using synthetic data for generative model training creates "self-consuming loops" Our paper aims to stabilize self-consuming generative model training by introducing an idealized correction function.
arXiv Detail & Related papers (2024-02-11T02:34:42Z)
Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks. Such models tend to be large and require commensurate volumes of training data. It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs. Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
Exploring the Effectiveness of Dataset Synthesis: An application of Apple Detection in Orchards [68.95806641664713]
We explore the usability of Stable Diffusion 2.1-base for generating synthetic datasets of apple trees for object detection. We train a YOLOv5m object detection model to predict apples in a real-world apple detection dataset. Results demonstrate that the model trained on generated data is slightly underperforming compared to a baseline model trained on real-world images.
arXiv Detail & Related papers (2023-06-20T09:46:01Z)
Scaling Data-Constrained Language Models [137.17302576977346]
We investigate scaling language models in data-constrained regimes. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters.
arXiv Detail & Related papers (2023-05-25T17:18:55Z)
Diffusion Dataset Generation: Towards Closing the Sim2Real Gap for Pedestrian Detection [0.11470070927586014]
We propose a novel method of synthetic data creation meant to close the sim2real gap for the pedestrian detection task. Our method uses a diffusion-based architecture to learn a real-world distribution which, once trained, is used to generate datasets. We show that training on a combination of generated and simulated data increases average precision by as much as 27.3% for pedestrian detection models in real-world data.
arXiv Detail & Related papers (2023-05-16T12:33:51Z)
Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task. We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
Reducing the Amount of Real World Data for Object Detector Training with Synthetic Data [1.0312968200748116]
We quantify how much real world data can be saved when using a mixed dataset of synthetic and real world data. We find that the need for real world data can be reduced by up to 70% without sacrificing detection performance.
arXiv Detail & Related papers (2022-01-31T08:13:12Z)
MLReal: Bridging the gap between training on synthetic data and real data applications in machine learning [1.9852463786440129]
We describe a novel approach to enhance supervised training on synthetic data with real data features. In the training stage, the input data are from the synthetic domain and the auto-correlated data are from the real domain. In the inference/application stage, the input data are from the real subset domain and the mean of the autocorrelated sections are from the synthetic data subset domain.
arXiv Detail & Related papers (2021-09-11T14:43:34Z)
STAN: Synthetic Network Traffic Generation with Generative Neural Models [10.54843182184416]
This paper presents STAN (Synthetic network Traffic generation with Autoregressive Neural models), a tool to generate realistic synthetic network traffic datasets. Our novel neural architecture captures both temporal dependencies and dependence between attributes at any given time. We evaluate the performance of STAN in terms of the quality of data generated, by training it on both a simulated dataset and a real network traffic data set.
arXiv Detail & Related papers (2020-09-27T04:20:02Z)
AutoSimulate: (Quickly) Learning Synthetic Data Generation [70.82315853981838]
We propose an efficient alternative for optimal synthetic data generation based on a novel differentiable approximation of the objective. We demonstrate that the proposed method finds the optimal data distribution faster (up to $50times$), with significantly reduced training data generation (up to $30times$) and better accuracy ($+8.7%$) on real-world test datasets than previous methods.
arXiv Detail & Related papers (2020-08-16T11:36:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.