Convex space learning improves deep-generative oversampling for tabular
imbalanced classification on smaller datasets
- URL: http://arxiv.org/abs/2206.09812v1
- Date: Mon, 20 Jun 2022 14:42:06 GMT
- Title: Convex space learning improves deep-generative oversampling for tabular
imbalanced classification on smaller datasets
- Authors: Kristian Schultz, Saptarshi Bej, Waldemar Hahn, Markus Wolfien,
Prashant Srivastava, Olaf Wolkenhauer
- Abstract summary: We show that existing deep generative models perform poorly compared to linear approaches generating synthetic samples from the convex space of the minority class.
We propose a deep generative model, ConvGeN combining the idea of convex space learning and deep generative models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data is commonly stored in tabular format. Several fields of research (e.g.,
biomedical, fault/fraud detection), are prone to small imbalanced tabular data.
Supervised Machine Learning on such data is often difficult due to class
imbalance, adding further to the challenge. Synthetic data generation i.e.
oversampling is a common remedy used to improve classifier performance.
State-of-the-art linear interpolation approaches, such as LoRAS and ProWRAS can
be used to generate synthetic samples from the convex space of the minority
class to improve classifier performance in such cases. Generative Adversarial
Networks (GANs) are common deep learning approaches for synthetic sample
generation. Although GANs are widely used for synthetic image generation, their
scope on tabular data in the context of imbalanced classification is not
adequately explored. In this article, we show that existing deep generative
models perform poorly compared to linear interpolation approaches generating
synthetic samples from the convex space of the minority class, for imbalanced
classification problems on tabular datasets of small size. We propose a deep
generative model, ConvGeN combining the idea of convex space learning and deep
generative models. ConVGeN learns the coefficients for the convex combinations
of the minority class samples, such that the synthetic data is distinct enough
from the majority class. We demonstrate that our proposed model ConvGeN
improves imbalanced classification on such small datasets, as compared to
existing deep generative models while being at par with the existing linear
interpolation approaches. Moreover, we discuss how our model can be used for
synthetic tabular data generation in general, even outside the scope of data
imbalance, and thus, improves the overall applicability of convex space
learning.
Related papers
- Synthetic Tabular Data Generation for Class Imbalance and Fairness: A Comparative Study [4.420073761023326]
Due to their data-driven nature, Machine Learning (ML) models are susceptible to bias inherited from data.
Class imbalance (in the classification target) and group imbalance (in protected attributes like sex or race) can undermine both ML utility and fairness.
This paper conducts a comparative analysis to address class and group imbalances using state-of-the-art models.
arXiv Detail & Related papers (2024-09-08T20:08:09Z) - Convex space learning for tabular synthetic data generation [0.0]
We introduce a deep learning architecture with a generator and discriminator component that can generate synthetic samples.
Synthetic samples generated by NextConvGeN can better preserve classification and clustering performance across real and synthetic data.
arXiv Detail & Related papers (2024-07-13T07:07:35Z) - Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance [16.047084318753377]
Imbalanced data and spurious correlations are common challenges in machine learning and data science.
Oversampling, which artificially increases the number of instances in the underrepresented classes, has been widely adopted to tackle these challenges.
We introduce OPAL, a systematic oversampling approach that leverages the capabilities of large language models to generate high-quality synthetic data for minority groups.
arXiv Detail & Related papers (2024-06-05T21:24:26Z) - Improving SMOTE via Fusing Conditional VAE for Data-adaptive Noise Filtering [0.5735035463793009]
We introduce a framework to enhance the SMOTE algorithm using Variational Autoencoders (VAE)
Our approach systematically quantifies the density of data points in a low-dimensional latent space using the VAE, simultaneously incorporating information on class labels and classification difficulty.
Empirical studies on several imbalanced datasets represent that this simple process innovatively improves the conventional SMOTE algorithm over the deep learning models.
arXiv Detail & Related papers (2024-05-30T07:06:02Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - VTAE: Variational Transformer Autoencoder with Manifolds Learning [144.0546653941249]
Deep generative models have demonstrated successful applications in learning non-linear data distributions through a number of latent variables.
The nonlinearity of the generator implies that the latent space shows an unsatisfactory projection of the data space, which results in poor representation learning.
We show that geodesics and accurate computation can substantially improve the performance of deep generative models.
arXiv Detail & Related papers (2023-04-03T13:13:19Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - Imbalanced Classification via a Tabular Translation GAN [4.864819846886142]
We present a model based on Generative Adversarial Networks which uses additional regularization losses to map majority samples to corresponding synthetic minority samples.
We show that the proposed method improves average precision when compared to alternative re-weighting and oversampling techniques.
arXiv Detail & Related papers (2022-04-19T06:02:53Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.