Towards a methodology for addressing missingness in datasets, with an
application to demographic health datasets
- URL: http://arxiv.org/abs/2211.02856v1
- Date: Sat, 5 Nov 2022 09:02:30 GMT
- Title: Towards a methodology for addressing missingness in datasets, with an
application to demographic health datasets
- Authors: Gift Khangamwa, Terence L. van Zyl and Clint J. van Alten
- Abstract summary: We present a methodology for tackling missing data problems using a combination of synthetic dataset generation, missing data imputation and deep learning methods.
Our results show that models trained on synthetic and imputed datasets could make predictions with an accuracy of $83 %$ and $80 %$ on $a) $ an unseen real dataset and $b)$ an unseen reserved synthetic test dataset.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Missing data is a common concern in health datasets, and its impact on good
decision-making processes is well documented. Our study's contribution is a
methodology for tackling missing data problems using a combination of synthetic
dataset generation, missing data imputation and deep learning methods to
resolve missing data challenges. Specifically, we conducted a series of
experiments with these objectives; $a)$ generating a realistic synthetic
dataset, $b)$ simulating data missingness, $c)$ recovering the missing data,
and $d)$ analyzing imputation performance. Our methodology used a gaussian
mixture model whose parameters were learned from a cleaned subset of a real
demographic and health dataset to generate the synthetic data. We simulated
various missingness degrees ranging from $10 \%$, $20 \%$, $30 \%$, and $40\%$
under the missing completely at random scheme MCAR. We used an integrated
performance analysis framework involving clustering, classification and direct
imputation analysis. Our results show that models trained on synthetic and
imputed datasets could make predictions with an accuracy of $83 \%$ and $80 \%$
on $a) $ an unseen real dataset and $b)$ an unseen reserved synthetic test
dataset, respectively. Moreover, the models that used the DAE method for
imputed yielded the lowest log loss an indication of good performance, even
though the accuracy measures were slightly lower. In conclusion, our work
demonstrates that using our methodology, one can reverse engineer a solution to
resolve missingness on an unseen dataset with missingness. Moreover, though we
used a health dataset, our methodology can be utilized in other contexts.
Related papers
- M$^3$-Impute: Mask-guided Representation Learning for Missing Value Imputation [12.174699459648842]
M$3$-Impute aims to explicitly leverage the missingness information and such correlations with novel masking schemes.
Experiment results show the effectiveness of M$3$-Impute by achieving 20 best and 4 second-best MAE scores on average.
arXiv Detail & Related papers (2024-10-11T13:25:32Z) - Towards Reducing Data Acquisition and Labeling for Defect Detection using Simulated Data [0.04194295877935867]
In many manufacturing settings, annotating data for machine learning and computer vision is costly, but synthetic data can be generated at significantly lower cost.
Substituting the real-world data with synthetic data is therefore appealing for many machine learning applications that require large amounts of training data.
We discuss approaches for dealing with such a domain shift when detecting defects in X-ray scans of aluminium wheels.
arXiv Detail & Related papers (2024-06-27T13:51:53Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - How Good Are Synthetic Medical Images? An Empirical Study with Lung
Ultrasound [0.3312417881789094]
Adding synthetic training data using generative models offers a low-cost method to deal with the data scarcity challenge.
We show that training with both synthetic and real data outperforms training with real data alone.
arXiv Detail & Related papers (2023-10-05T15:42:53Z) - Exploring the Effectiveness of Dataset Synthesis: An application of
Apple Detection in Orchards [68.95806641664713]
We explore the usability of Stable Diffusion 2.1-base for generating synthetic datasets of apple trees for object detection.
We train a YOLOv5m object detection model to predict apples in a real-world apple detection dataset.
Results demonstrate that the model trained on generated data is slightly underperforming compared to a baseline model trained on real-world images.
arXiv Detail & Related papers (2023-06-20T09:46:01Z) - Machine Learning Based Missing Values Imputation in Categorical Datasets [2.5611256859404983]
This research looked into the use of machine learning algorithms to fill in the gaps in categorical datasets.
The emphasis was on ensemble models constructed using the Error Correction Output Codes framework.
Deep learning for missing data imputation has obstacles despite these encouraging results, including the requirement for large amounts of labeled data.
arXiv Detail & Related papers (2023-06-10T03:29:48Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - AutoSimulate: (Quickly) Learning Synthetic Data Generation [70.82315853981838]
We propose an efficient alternative for optimal synthetic data generation based on a novel differentiable approximation of the objective.
We demonstrate that the proposed method finds the optimal data distribution faster (up to $50times$), with significantly reduced training data generation (up to $30times$) and better accuracy ($+8.7%$) on real-world test datasets than previous methods.
arXiv Detail & Related papers (2020-08-16T11:36:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.