Evaluating the Utility of GAN Generated Synthetic Tabular Data for Class
Balancing and Low Resource Settings
- URL: http://arxiv.org/abs/2306.13929v1
- Date: Sat, 24 Jun 2023 10:27:08 GMT
- Title: Evaluating the Utility of GAN Generated Synthetic Tabular Data for Class
Balancing and Low Resource Settings
- Authors: Nagarjuna Chereddy and Bharath Kumar Bolla
- Abstract summary: The study employed the Generalised Linear Model (GLM) algorithm for class balancing experiments.
In low-resource experiments, models trained on data enhanced with GAN-synthesized data exhibited better recall values than original data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The present study aimed to address the issue of imbalanced data in
classification tasks and evaluated the suitability of SMOTE, ADASYN, and GAN
techniques in generating synthetic data to address the class imbalance and
improve the performance of classification models in low-resource settings. The
study employed the Generalised Linear Model (GLM) algorithm for class balancing
experiments and the Random Forest (RF) algorithm for low-resource setting
experiments to assess model performance under varying training data. The recall
metric was the primary evaluation metric for all classification models. The
results of the class balancing experiments showed that the GLM model trained on
GAN-balanced data achieved the highest recall value. Similarly, in low-resource
experiments, models trained on data enhanced with GAN-synthesized data
exhibited better recall values than original data. These findings demonstrate
the potential of GAN-generated synthetic data for addressing the challenge of
imbalanced data in classification tasks and improving model performance in
low-resource settings.
Related papers
- Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance.
We introduce novel algorithms for dynamic, instance-level data reweighting.
Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z) - Synthetic Feature Augmentation Improves Generalization Performance of Language Models [8.463273762997398]
Training and fine-tuning deep learning models on limited and imbalanced datasets poses substantial challenges.
We propose augmenting features in the embedding space by generating synthetic samples using a range of techniques.
We validate the effectiveness of this approach across multiple open-source text classification benchmarks.
arXiv Detail & Related papers (2025-01-11T04:31:18Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Improving SMOTE via Fusing Conditional VAE for Data-adaptive Noise Filtering [0.5735035463793009]
We introduce a framework to enhance the SMOTE algorithm using Variational Autoencoders (VAE)
Our approach systematically quantifies the density of data points in a low-dimensional latent space using the VAE, simultaneously incorporating information on class labels and classification difficulty.
Empirical studies on several imbalanced datasets represent that this simple process innovatively improves the conventional SMOTE algorithm over the deep learning models.
arXiv Detail & Related papers (2024-05-30T07:06:02Z) - Synthetic Information towards Maximum Posterior Ratio for deep learning
on Imbalanced Data [1.7495515703051119]
We propose a technique for data balancing by generating synthetic data for the minority class.
Our method prioritizes balancing the informative regions by identifying high entropy samples.
Our experimental results on forty-one datasets demonstrate the superior performance of our technique.
arXiv Detail & Related papers (2024-01-05T01:08:26Z) - TRIAGE: Characterizing and auditing training data for improved
regression [80.11415390605215]
We introduce TRIAGE, a novel data characterization framework tailored to regression tasks and compatible with a broad class of regressors.
TRIAGE utilizes conformal predictive distributions to provide a model-agnostic scoring method, the TRIAGE score.
We show that TRIAGE's characterization is consistent and highlight its utility to improve performance via data sculpting/filtering, in multiple regression settings.
arXiv Detail & Related papers (2023-10-29T10:31:59Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - Semi-Supervised Learning Based on Reference Model for Low-resource TTS [32.731900584216724]
We propose a semi-supervised learning method for neural TTS in which labeled target data is limited.
Experimental results show that our proposed semi-supervised learning scheme with limited target data significantly improves the voice quality for test data to achieve naturalness and robustness in speech synthesis.
arXiv Detail & Related papers (2022-10-25T07:48:07Z) - FairIF: Boosting Fairness in Deep Learning via Influence Functions with
Validation Set Sensitive Attributes [51.02407217197623]
We propose a two-stage training algorithm named FAIRIF.
It minimizes the loss over the reweighted data set where the sample weights are computed.
We show that FAIRIF yields models with better fairness-utility trade-offs against various types of bias.
arXiv Detail & Related papers (2022-01-15T05:14:48Z) - Sampling To Improve Predictions For Underrepresented Observations In
Imbalanced Data [0.0]
Data imbalance negatively impacts the predictive performance of models on underrepresented observations.
We propose sampling to adjust for this imbalance with the goal of improving the performance of models trained on historical production data.
We apply our methods on a large biopharmaceutical manufacturing data set from an advanced simulation of penicillin production.
arXiv Detail & Related papers (2021-11-17T12:16:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.