Synthesising Multi-Modal Minority Samples for Tabular Data
- URL: http://arxiv.org/abs/2105.08204v1
- Date: Mon, 17 May 2021 23:54:08 GMT
- Title: Synthesising Multi-Modal Minority Samples for Tabular Data
- Authors: Sajad Darabi and Yotam Elor
- Abstract summary: Adding synthetic minority samples to the dataset before training is a popular technique to address this difficulty.
We propose a latent space framework which maps the multi-modal samples to a dense continuous latent space.
We show that our framework generates better synthetic data than the existing methods.
- Score: 3.7311680121118345
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-world binary classification tasks are in many cases imbalanced, where
the minority class is much smaller than the majority class. This skewness is
challenging for machine learning algorithms as they tend to focus on the
majority and greatly misclassify the minority. Adding synthetic minority
samples to the dataset before training the model is a popular technique to
address this difficulty and is commonly achieved by interpolating minority
samples. Tabular datasets are often multi-modal and contain discrete
(categorical) features in addition to continuous ones which makes interpolation
of samples non-trivial. To address this, we propose a latent space
interpolation framework which (1) maps the multi-modal samples to a dense
continuous latent space using an autoencoder; (2) applies oversampling by
interpolation in the latent space; and (3) maps the synthetic samples back to
the original feature space. We defined metrics to directly evaluate the quality
of the minority data generated and showed that our framework generates better
synthetic data than the existing methods. Furthermore, the superior synthetic
data yields better prediction quality in downstream binary classification
tasks, as was demonstrated in extensive experiments with 27 publicly available
real-world datasets
Related papers
- AEMLO: AutoEncoder-Guided Multi-Label Oversampling [6.255095509216069]
AEMLO is an AutoEncoder-guided Oversampling technique for imbalanced multi-label data.
We show that AEMLO outperforms the existing state-of-the-art methods with extensive empirical studies.
arXiv Detail & Related papers (2024-08-23T14:01:33Z) - Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - Self-Guided Generation of Minority Samples Using Diffusion Models [57.319845580050924]
We present a novel approach for generating minority samples that live on low-density regions of a data manifold.
Our framework is built upon diffusion models, leveraging the principle of guided sampling.
Experiments on benchmark real datasets demonstrate that our approach can greatly improve the capability of creating realistic low-likelihood minority instances.
arXiv Detail & Related papers (2024-07-16T10:03:29Z) - Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks.
Such models tend to be large and require commensurate volumes of training data.
It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs.
Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Generative Oversampling for Imbalanced Data via Majority-Guided VAE [15.93867386081279]
We propose a novel over-sampling model, called Majority-Guided VAE(MGVAE), which generates new minority samples under the guidance of a majority-based prior.
In this way, the newly generated minority samples can inherit the diversity and richness of the majority ones, thus mitigating overfitting in downstream tasks.
arXiv Detail & Related papers (2023-02-14T06:35:23Z) - Don't Play Favorites: Minority Guidance for Diffusion Models [59.75996752040651]
We present a novel framework that can make the generation process of the diffusion models focus on the minority samples.
We develop minority guidance, a sampling technique that can guide the generation process toward regions with desired likelihood levels.
arXiv Detail & Related papers (2023-01-29T03:08:47Z) - Synthetic-to-Real Domain Generalized Semantic Segmentation for 3D Indoor
Point Clouds [69.64240235315864]
This paper introduces the synthetic-to-real domain generalization setting to this task.
The domain gap between synthetic and real-world point cloud data mainly lies in the different layouts and point patterns.
Experiments on the synthetic-to-real benchmark demonstrate that both CINMix and multi-prototypes can narrow the distribution gap.
arXiv Detail & Related papers (2022-12-09T05:07:43Z) - Undersampling is a Minimax Optimal Robustness Intervention in
Nonparametric Classification [28.128464387420216]
We show that learning is fundamentally constrained by a lack of minority group samples.
In particular, in the case of label shift we show that there is always an undersampling algorithm that is minimax optimal.
arXiv Detail & Related papers (2022-05-26T00:35:11Z) - Imbalanced Classification via a Tabular Translation GAN [4.864819846886142]
We present a model based on Generative Adversarial Networks which uses additional regularization losses to map majority samples to corresponding synthetic minority samples.
We show that the proposed method improves average precision when compared to alternative re-weighting and oversampling techniques.
arXiv Detail & Related papers (2022-04-19T06:02:53Z) - A Synthetic Over-sampling method with Minority and Majority classes for
imbalance problems [0.0]
We propose a new method to generate synthetic instances using Minority and Majority classes (SOMM)
SOMM generates synthetic instances diversely within the minority data space.
It updates the generated instances adaptively to the neighbourhood including both classes.
arXiv Detail & Related papers (2020-11-09T03:39:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.