Comparing Synthetic Tabular Data Generation Between a Probabilistic
Model and a Deep Learning Model for Education Use Cases
- URL: http://arxiv.org/abs/2210.08528v1
- Date: Sun, 16 Oct 2022 13:21:23 GMT
- Title: Comparing Synthetic Tabular Data Generation Between a Probabilistic
Model and a Deep Learning Model for Education Use Cases
- Authors: Herkulaas MvE Combrink, Vukosi Marivate, Benjamin Rosman
- Abstract summary: The ability to generate synthetic data has a variety of use cases across different domains.
In education research, there is a growing need to have access to synthetic data to test certain concepts and ideas.
- Score: 12.358921226358133
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to generate synthetic data has a variety of use cases across
different domains. In education research, there is a growing need to have
access to synthetic data to test certain concepts and ideas. In recent years,
several deep learning architectures were used to aid in the generation of
synthetic data but with varying results. In the education context, the
sophistication of implementing different models requiring large datasets is
becoming very important. This study aims to compare the application of
synthetic tabular data generation between a probabilistic model specifically a
Bayesian Network, and a deep learning model, specifically a Generative
Adversarial Network using a classification task. The results of this study
indicate that synthetic tabular data generation is better suited for the
education context using probabilistic models (overall accuracy of 75%) than
deep learning architecture (overall accuracy of 38%) because of probabilistic
interdependence. Lastly, we recommend that other data types, should be explored
and evaluated for their application in generating synthetic data for education
use cases.
Related papers
- Exploring the Landscape for Generative Sequence Models for Specialized Data Synthesis [0.0]
This paper introduces a novel approach that leverages three generative models of varying complexity to synthesize Malicious Network Traffic.
Our approach transforms numerical data into text, re-framing data generation as a language modeling task.
Our method surpasses state-of-the-art generative models in producing high-fidelity synthetic data.
arXiv Detail & Related papers (2024-11-04T09:51:10Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Generation of Probabilistic Synthetic Data for Serious Games: A Case
Study on Cyberbullying [0.45880283710344055]
We propose a simulator for generating probabilistic synthetic data for serious games based on interactive narratives.
This architecture is designed to be generic and modular so that it can be used by other researchers on similar problems.
We apply the proposed architecture and methods in a use case of a serious game focused on cyberbullying.
arXiv Detail & Related papers (2023-06-02T08:43:15Z) - Utility Theory of Synthetic Data Generation [12.511220449652384]
This paper bridges the practice-theory gap by establishing relevant utility theory in a statistical learning framework.
It considers two utility metrics: generalization and ranking of models trained on synthetic data.
arXiv Detail & Related papers (2023-05-17T07:49:16Z) - Generating Realistic Synthetic Relational Data through Graph Variational
Autoencoders [47.89542334125886]
We combine the variational autoencoder framework with graph neural networks to generate realistic synthetic relational databases.
The results indicate that real databases' structures are accurately preserved in the resulting synthetic datasets.
arXiv Detail & Related papers (2022-11-30T10:40:44Z) - Evaluation of Categorical Generative Models -- Bridging the Gap Between
Real and Synthetic Data [18.142397311464343]
We introduce an appropriately scalable evaluation method for generative models.
We consider increasingly large probability spaces, which correspond to increasingly difficult modeling tasks.
We validate our evaluation procedure with synthetic experiments on both synthetic generative models and current state-of-the-art categorical generative models.
arXiv Detail & Related papers (2022-10-28T21:05:25Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Model-agnostic multi-objective approach for the evolutionary discovery
of mathematical models [55.41644538483948]
In modern data science, it is more interesting to understand the properties of the model, which parts could be replaced to obtain better results.
We use multi-objective evolutionary optimization for composite data-driven model learning to obtain the algorithm's desired properties.
arXiv Detail & Related papers (2021-07-07T11:17:09Z) - Foundations of Bayesian Learning from Synthetic Data [1.6249267147413522]
We use a Bayesian paradigm to characterise the updating of model parameters when learning on synthetic data.
Recent results from general Bayesian updating support a novel and robust approach to synthetic-learning founded on decision theory.
arXiv Detail & Related papers (2020-11-16T21:49:17Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.