SMOTE-ENC: A novel SMOTE-based method to generate synthetic data for
nominal and continuous features
- URL: http://arxiv.org/abs/2103.07612v1
- Date: Sat, 13 Mar 2021 04:16:17 GMT
- Title: SMOTE-ENC: A novel SMOTE-based method to generate synthetic data for
nominal and continuous features
- Authors: Mimi Mukherjee and Matloob Khushi
- Abstract summary: We present a novel minority over-sampling method, SMOTE-ENC (SMOTE - Encoded Nominal and Continuous)
Our experiments show that the classification model using SMOTE-ENC method offers better prediction than model using SMOTE-NC.
Our proposed method addressed one of the major limitations of SMOTE-NC algorithm.
- Score: 0.38073142980733
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real world datasets are heavily skewed where some classes are significantly
outnumbered by the other classes. In these situations, machine learning
algorithms fail to achieve substantial efficacy while predicting these
under-represented instances. To solve this problem, many variations of
synthetic minority over-sampling methods (SMOTE) have been proposed to balance
the dataset which deals with continuous features. However, for datasets with
both nominal and continuous features, SMOTE-NC is the only SMOTE-based
over-sampling technique to balance the data. In this paper, we present a novel
minority over-sampling method, SMOTE-ENC (SMOTE - Encoded Nominal and
Continuous), in which, nominal features are encoded as numeric values and the
difference between two such numeric value reflects the amount of change of
association with minority class. Our experiments show that the classification
model using SMOTE-ENC method offers better prediction than model using SMOTE-NC
when the dataset has a substantial number of nominal features and also when
there is some association between the categorical features and the target
class. Additionally, our proposed method addressed one of the major limitations
of SMOTE-NC algorithm. SMOTE-NC can be applied only on mixed datasets that have
features consisting of both continuous and nominal features and cannot function
if all the features of the dataset are nominal. Our novel method has been
generalized to be applied on both mixed datasets and on nominal only datasets.
The code is available from mkhushi.github.io
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Evaluating Graph Neural Networks for Link Prediction: Current Pitfalls
and New Benchmarking [66.83273589348758]
Link prediction attempts to predict whether an unseen edge exists based on only a portion of edges of a graph.
A flurry of methods have been introduced in recent years that attempt to make use of graph neural networks (GNNs) for this task.
New and diverse datasets have also been created to better evaluate the effectiveness of these new models.
arXiv Detail & Related papers (2023-06-18T01:58:59Z) - Towards Automated Imbalanced Learning with Deep Hierarchical
Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class.
Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class.
We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z) - Solving the Class Imbalance Problem Using a Counterfactual Method for
Data Augmentation [4.454557728745761]
Learning from class imbalanced datasets poses challenges for machine learning algorithms.
We advance a novel data augmentation method (adapted from eXplainable AI) that generates synthetic, counterfactual instances in the minority class.
Several experiments using four different classifiers and 25 datasets are reported, which show that this Counterfactual Augmentation method (CFA) generates useful synthetic data points in the minority class.
arXiv Detail & Related papers (2021-11-05T14:14:06Z) - Gated recurrent units and temporal convolutional network for multilabel
classification [122.84638446560663]
This work proposes a new ensemble method for managing multilabel classification.
The core of the proposed approach combines a set of gated recurrent units and temporal convolutional neural networks trained with variants of the Adam gradients optimization approach.
arXiv Detail & Related papers (2021-10-09T00:00:16Z) - SMOTified-GAN for class imbalanced pattern classification problems [0.41998444721319217]
We propose a novel two-phase oversampling approach that has the synergy of SMOTE and GAN.
The experimental results prove the sample quality of minority class(es) has been improved in a variety of tested benchmark datasets.
arXiv Detail & Related papers (2021-08-06T06:14:05Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - GMOTE: Gaussian based minority oversampling technique for imbalanced
classification adapting tail probability of outliers [0.0]
Data-level approaches mainly use the oversampling methods to solve the problem, such as synthetic minority oversampling Technique (SMOTE)
In this paper, we proposed Gaussian based minority oversampling technique (GMOTE) with a statistical perspective for imbalanced datasets.
When the GMOTE is combined with classification and regression tree (CART) or support vector machine (SVM), it shows better accuracy and F1-Score.
arXiv Detail & Related papers (2021-05-09T07:04:37Z) - Improving Calibration for Long-Tailed Recognition [68.32848696795519]
We propose two methods to improve calibration and performance in such scenarios.
For dataset bias due to different samplers, we propose shifted batch normalization.
Our proposed methods set new records on multiple popular long-tailed recognition benchmark datasets.
arXiv Detail & Related papers (2021-04-01T13:55:21Z) - A Novel Resampling Technique for Imbalanced Dataset Optimization [1.0323063834827415]
classification of rare events is a common problem in many domains, such as fraudulent transactions, malware traffic analysis and network intrusion detection.
We develop two versions of Generative Silhouette Resampling 1-Nearest Neighbour (G1Nos) oversampling algorithms for dealing with class imbalance problem.
arXiv Detail & Related papers (2020-12-30T17:17:08Z) - Deep Synthetic Minority Over-Sampling Technique [3.3707422585608953]
We adapt the SMOTE idea in deep learning architecture.
Deep SMOTE can outperform traditional SMOTE in terms of precision, F1 score and Area Under Curve (AUC) in majority of test cases.
arXiv Detail & Related papers (2020-03-22T02:44:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.