Related papers: Convex space learning improves deep-generative oversampling for tabular imbalanced classification on smaller datasets

Convex space learning improves deep-generative oversampling for tabular imbalanced classification on smaller datasets

URL: http://arxiv.org/abs/2206.09812v1
Date: Mon, 20 Jun 2022 14:42:06 GMT
Title: Convex space learning improves deep-generative oversampling for tabular imbalanced classification on smaller datasets
Authors: Kristian Schultz, Saptarshi Bej, Waldemar Hahn, Markus Wolfien, Prashant Srivastava, Olaf Wolkenhauer
Abstract summary: We show that existing deep generative models perform poorly compared to linear approaches generating synthetic samples from the convex space of the minority class. We propose a deep generative model, ConvGeN combining the idea of convex space learning and deep generative models.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data is commonly stored in tabular format. Several fields of research (e.g., biomedical, fault/fraud detection), are prone to small imbalanced tabular data. Supervised Machine Learning on such data is often difficult due to class imbalance, adding further to the challenge. Synthetic data generation i.e. oversampling is a common remedy used to improve classifier performance. State-of-the-art linear interpolation approaches, such as LoRAS and ProWRAS can be used to generate synthetic samples from the convex space of the minority class to improve classifier performance in such cases. Generative Adversarial Networks (GANs) are common deep learning approaches for synthetic sample generation. Although GANs are widely used for synthetic image generation, their scope on tabular data in the context of imbalanced classification is not adequately explored. In this article, we show that existing deep generative models perform poorly compared to linear interpolation approaches generating synthetic samples from the convex space of the minority class, for imbalanced classification problems on tabular datasets of small size. We propose a deep generative model, ConvGeN combining the idea of convex space learning and deep generative models. ConVGeN learns the coefficients for the convex combinations of the minority class samples, such that the synthetic data is distinct enough from the majority class. We demonstrate that our proposed model ConvGeN improves imbalanced classification on such small datasets, as compared to existing deep generative models while being at par with the existing linear interpolation approaches. Moreover, we discuss how our model can be used for synthetic tabular data generation in general, even outside the scope of data imbalance, and thus, improves the overall applicability of convex space learning.

Related papers

A Conditional GAN for Tabular Data Generation with Probabilistic Sampling of Latent Subspaces [3.038642416291856]
We present ctdGAN, a conditional GAN for alleviating class imbalance in datasets.<n>ctdGAN executes a space partitioning step to assign cluster labels to the input samples.<n>It then utilizes these labels to synthesize samples via a novel probabilistic sampling strategy.<n>In this way, ctdGAN is trained to generate samples in subspaces that resemble those of the original data distribution.
arXiv Detail & Related papers (2025-08-01T09:49:57Z)
Synthetic Tabular Data Generation for Imbalanced Classification: The Surprising Effectiveness of an Overlap Class [20.606333546028516]
We show that state-of-the-art deep generative models yield significantly lower-quality minority examples than majority examples. We propose a novel technique of converting the binary class labels to ternary class labels by introducing a class for the region where minority and majority distributions overlap.
arXiv Detail & Related papers (2024-12-20T08:15:20Z)
Synthetic Tabular Data Generation for Class Imbalance and Fairness: A Comparative Study [4.420073761023326]
Due to their data-driven nature, Machine Learning (ML) models are susceptible to bias inherited from data. Class imbalance (in the classification target) and group imbalance (in protected attributes like sex or race) can undermine both ML utility and fairness. This paper conducts a comparative analysis to address class and group imbalances using state-of-the-art models.
arXiv Detail & Related papers (2024-09-08T20:08:09Z)
Convex space learning for tabular synthetic data generation [0.0]
We introduce a deep learning architecture with a generator and discriminator component that can generate synthetic samples. Synthetic samples generated by NextConvGeN can better preserve classification and clustering performance across real and synthetic data.
arXiv Detail & Related papers (2024-07-13T07:07:35Z)
Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance [16.047084318753377]
Imbalanced data and spurious correlations are common challenges in machine learning and data science. Oversampling, which artificially increases the number of instances in the underrepresented classes, has been widely adopted to tackle these challenges. We introduce OPAL, a systematic oversampling approach that leverages the capabilities of large language models to generate high-quality synthetic data for minority groups.
arXiv Detail & Related papers (2024-06-05T21:24:26Z)
Improving SMOTE via Fusing Conditional VAE for Data-adaptive Noise Filtering [0.5735035463793009]
We introduce a framework to enhance the SMOTE algorithm using Variational Autoencoders (VAE) Our approach systematically quantifies the density of data points in a low-dimensional latent space using the VAE, simultaneously incorporating information on class labels and classification difficulty. Empirical studies on several imbalanced datasets represent that this simple process innovatively improves the conventional SMOTE algorithm over the deep learning models.
arXiv Detail & Related papers (2024-05-30T07:06:02Z)
Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data. We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap. Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z)
Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers. We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes. We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z)
Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task. We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
VTAE: Variational Transformer Autoencoder with Manifolds Learning [144.0546653941249]
Deep generative models have demonstrated successful applications in learning non-linear data distributions through a number of latent variables. The nonlinearity of the generator implies that the latent space shows an unsatisfactory projection of the data space, which results in poor representation learning. We show that geodesics and accurate computation can substantially improve the performance of deep generative models.
arXiv Detail & Related papers (2023-04-03T13:13:19Z)
Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data. We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z)
Imbalanced Classification via a Tabular Translation GAN [4.864819846886142]
We present a model based on Generative Adversarial Networks which uses additional regularization losses to map majority samples to corresponding synthetic minority samples. We show that the proposed method improves average precision when compared to alternative re-weighting and oversampling techniques.
arXiv Detail & Related papers (2022-04-19T06:02:53Z)
Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers. We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model. Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.