Related papers: Synthesising Multi-Modal Minority Samples for Tabular Data

Synthesising Multi-Modal Minority Samples for Tabular Data

URL: http://arxiv.org/abs/2105.08204v1
Date: Mon, 17 May 2021 23:54:08 GMT
Title: Synthesising Multi-Modal Minority Samples for Tabular Data
Authors: Sajad Darabi and Yotam Elor
Abstract summary: Adding synthetic minority samples to the dataset before training is a popular technique to address this difficulty. We propose a latent space framework which maps the multi-modal samples to a dense continuous latent space. We show that our framework generates better synthetic data than the existing methods.
Score: 3.7311680121118345
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Real-world binary classification tasks are in many cases imbalanced, where the minority class is much smaller than the majority class. This skewness is challenging for machine learning algorithms as they tend to focus on the majority and greatly misclassify the minority. Adding synthetic minority samples to the dataset before training the model is a popular technique to address this difficulty and is commonly achieved by interpolating minority samples. Tabular datasets are often multi-modal and contain discrete (categorical) features in addition to continuous ones which makes interpolation of samples non-trivial. To address this, we propose a latent space interpolation framework which (1) maps the multi-modal samples to a dense continuous latent space using an autoencoder; (2) applies oversampling by interpolation in the latent space; and (3) maps the synthetic samples back to the original feature space. We defined metrics to directly evaluate the quality of the minority data generated and showed that our framework generates better synthetic data than the existing methods. Furthermore, the superior synthetic data yields better prediction quality in downstream binary classification tasks, as was demonstrated in extensive experiments with 27 publicly available real-world datasets

Related papers

Improving Predictions on Highly Unbalanced Data Using Open Source Synthetic Data Upsampling [0.0]
We show that synthetic data can improve predictive accuracy for minority groups by generating diverse data points that fill gaps in sparse regions of the feature space.<n>We evaluate the effectiveness of an open-source solution, the Synthetic Data SDK by MOSTLY AI, which provides a flexible and user-friendly approach to synthetic upsampling for mixed-type data.
arXiv Detail & Related papers (2025-07-22T10:11:32Z)
Adaptive Cluster-Based Synthetic Minority Oversampling Technique for Traffic Mode Choice Prediction with Imbalanced Dataset [0.0]
Density-based spatial clustering is applied on minority classes to identify subgroups. The classes in each of these subgroups are then oversampled according to the ratio of data points of their local cluster to the largest majority class. When used in conjunction with machine learning models such as random forest and extreme gradient boosting, this oversampling method results in significantly higher F1 scores for the minority classes.
arXiv Detail & Related papers (2025-04-13T08:58:31Z)
Boost-and-Skip: A Simple Guidance-Free Diffusion for Minority Generation [57.19995625893062]
We present a powerful yet powerful guidance-free approach called Boost-and-Skip for generating minority samples using diffusion models. We highlight that these seemingly-trivial modifications are supported by solid theoretical and empirical evidence. Our experiments demonstrate that Boost-and-Skip greatly enhances the capability of generating minority samples, even rivaling guidance-based state-of-the-art approaches.
arXiv Detail & Related papers (2025-02-10T14:37:26Z)
AEMLO: AutoEncoder-Guided Multi-Label Oversampling [6.255095509216069]
AEMLO is an AutoEncoder-guided Oversampling technique for imbalanced multi-label data. We show that AEMLO outperforms the existing state-of-the-art methods with extensive empirical studies.
arXiv Detail & Related papers (2024-08-23T14:01:33Z)
Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance. DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator. Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z)
Self-Guided Generation of Minority Samples Using Diffusion Models [57.319845580050924]
We present a novel approach for generating minority samples that live on low-density regions of a data manifold. Our framework is built upon diffusion models, leveraging the principle of guided sampling. Experiments on benchmark real datasets demonstrate that our approach can greatly improve the capability of creating realistic low-likelihood minority instances.
arXiv Detail & Related papers (2024-07-16T10:03:29Z)
Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks. Such models tend to be large and require commensurate volumes of training data. It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs. Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z)
Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers. We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes. We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z)
Generative Oversampling for Imbalanced Data via Majority-Guided VAE [15.93867386081279]
We propose a novel over-sampling model, called Majority-Guided VAE(MGVAE), which generates new minority samples under the guidance of a majority-based prior. In this way, the newly generated minority samples can inherit the diversity and richness of the majority ones, thus mitigating overfitting in downstream tasks.
arXiv Detail & Related papers (2023-02-14T06:35:23Z)
Don't Play Favorites: Minority Guidance for Diffusion Models [59.75996752040651]
We present a novel framework that can make the generation process of the diffusion models focus on the minority samples. We develop minority guidance, a sampling technique that can guide the generation process toward regions with desired likelihood levels.
arXiv Detail & Related papers (2023-01-29T03:08:47Z)
Synthetic-to-Real Domain Generalized Semantic Segmentation for 3D Indoor Point Clouds [69.64240235315864]
This paper introduces the synthetic-to-real domain generalization setting to this task. The domain gap between synthetic and real-world point cloud data mainly lies in the different layouts and point patterns. Experiments on the synthetic-to-real benchmark demonstrate that both CINMix and multi-prototypes can narrow the distribution gap.
arXiv Detail & Related papers (2022-12-09T05:07:43Z)
Undersampling is a Minimax Optimal Robustness Intervention in Nonparametric Classification [28.128464387420216]
We show that learning is fundamentally constrained by a lack of minority group samples. In particular, in the case of label shift we show that there is always an undersampling algorithm that is minimax optimal.
arXiv Detail & Related papers (2022-05-26T00:35:11Z)
Imbalanced Classification via a Tabular Translation GAN [4.864819846886142]
We present a model based on Generative Adversarial Networks which uses additional regularization losses to map majority samples to corresponding synthetic minority samples. We show that the proposed method improves average precision when compared to alternative re-weighting and oversampling techniques.
arXiv Detail & Related papers (2022-04-19T06:02:53Z)
A Synthetic Over-sampling method with Minority and Majority classes for imbalance problems [0.0]
We propose a new method to generate synthetic instances using Minority and Majority classes (SOMM) SOMM generates synthetic instances diversely within the minority data space. It updates the generated instances adaptively to the neighbourhood including both classes.
arXiv Detail & Related papers (2020-11-09T03:39:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.