A multi-schematic classifier-independent oversampling approach for
imbalanced datasets
- URL: http://arxiv.org/abs/2107.07349v1
- Date: Thu, 15 Jul 2021 14:03:24 GMT
- Title: A multi-schematic classifier-independent oversampling approach for
imbalanced datasets
- Authors: Saptarshi Bej, Kristian Schultz, Prashant Srivastava, Markus Wolfien,
Olaf Wolkenhauer
- Abstract summary: It is evident from previous studies that different oversampling algorithms have different degrees of efficiency with different classifiers.
Here, we overcome this problem with a multi-schematic and classifier-independent oversampling approach: ProWRAS.
ProWRAS integrates the Localized Random Affine Shadowsampling (LoRAS)algorithm and the Proximity Weighted Synthetic oversampling (ProWSyn) algorithm.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Over 85 oversampling algorithms, mostly extensions of the SMOTE algorithm,
have been built over the past two decades, to solve the problem of imbalanced
datasets. However, it has been evident from previous studies that different
oversampling algorithms have different degrees of efficiency with different
classifiers. With numerous algorithms available, it is difficult to decide on
an oversampling algorithm for a chosen classifier. Here, we overcome this
problem with a multi-schematic and classifier-independent oversampling
approach: ProWRAS(Proximity Weighted Random Affine Shadowsampling). ProWRAS
integrates the Localized Random Affine Shadowsampling (LoRAS)algorithm and the
Proximity Weighted Synthetic oversampling (ProWSyn) algorithm. By controlling
the variance of the synthetic samples, as well as a proximity-weighted
clustering system of the minority classdata, the ProWRAS algorithm improves
performance, compared to algorithms that generate synthetic samples through
modelling high dimensional convex spaces of the minority class. ProWRAS has
four oversampling schemes, each of which has its unique way to model the
variance of the generated data. Most importantly, the performance of ProWRAS
with proper choice of oversampling schemes, is independent of the classifier
used. We have benchmarked our newly developed ProWRAS algorithm against five
sate-of-the-art oversampling models and four different classifiers on 20
publicly available datasets. ProWRAS outperforms other oversampling algorithms
in a statistically significant way, in terms of both F1-score and Kappa-score.
Moreover, we have introduced a novel measure for classifier independence
I-score, and showed quantitatively that ProWRAS performs better, independent of
the classifier used. In practice, ProWRAS customizes synthetic sample
generation according to a classifier of choice and thereby reduces benchmarking
efforts.
Related papers
- Scaling LLM Inference with Optimized Sample Compute Allocation [56.524278187351925]
We propose OSCA, an algorithm to find an optimal mix of different inference configurations.
Our experiments show that with our learned mixed allocation, we can achieve accuracy better than the best single configuration.
OSCA is also shown to be effective in agentic beyond single-turn tasks, achieving a better accuracy on SWE-Bench with 3x less compute than the default configuration.
arXiv Detail & Related papers (2024-10-29T19:17:55Z) - INGB: Informed Nonlinear Granular Ball Oversampling Framework for Noisy
Imbalanced Classification [23.9207014576848]
In classification problems, the datasets are usually imbalanced, noisy or complex.
An informed nonlinear oversampling framework with the granular ball (INGB) as a new direction of oversampling is proposed in this paper.
arXiv Detail & Related papers (2023-07-03T01:55:20Z) - Towards Automated Imbalanced Learning with Deep Hierarchical
Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class.
Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class.
We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z) - Does Adversarial Oversampling Help us? [10.210871872870737]
We propose a three-player adversarial game-based end-to-end method to handle class imbalance in datasets.
Rather than adversarial minority oversampling, we propose an adversarial oversampling (AO) and a data-space oversampling (DO) approach.
The effectiveness of our proposed method has been validated with high-dimensional, highly imbalanced and large-scale multi-class datasets.
arXiv Detail & Related papers (2021-08-20T05:43:17Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - A Method for Handling Multi-class Imbalanced Data by Geometry based
Information Sampling and Class Prioritized Synthetic Data Generation (GICaPS) [15.433936272310952]
This paper looks into the problem of handling imbalanced data in a multi-label classification problem.
Two novel methods are proposed that exploit the geometric relationship between the feature vectors.
The efficacy of the proposed methods is analyzed by solving a generic multi-class recognition problem.
arXiv Detail & Related papers (2020-10-11T04:04:26Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - A Systematic Characterization of Sampling Algorithms for Open-ended
Language Generation [71.31905141672529]
We study the widely adopted ancestral sampling algorithms for auto-regressive language models.
We identify three key properties that are shared among them: entropy reduction, order preservation, and slope preservation.
We find that the set of sampling algorithms that satisfies these properties performs on par with the existing sampling algorithms.
arXiv Detail & Related papers (2020-09-15T17:28:42Z) - A Comparison of Synthetic Oversampling Methods for Multi-class Text
Classification [2.28438857884398]
The authors compare oversampling methods for the problem of multi-class topic classification.
The SMOTE algorithm underlies one of the most popular oversampling methods.
The authors conclude that for this task, the quality of the KNN and SVM algorithms is more influenced by class imbalance than neural networks.
arXiv Detail & Related papers (2020-08-11T11:41:53Z) - Non-Adaptive Adaptive Sampling on Turnstile Streams [57.619901304728366]
We give the first relative-error algorithms for column subset selection, subspace approximation, projective clustering, and volume on turnstile streams that use space sublinear in $n$.
Our adaptive sampling procedure has a number of applications to various data summarization problems that either improve state-of-the-art or have only been previously studied in the more relaxed row-arrival model.
arXiv Detail & Related papers (2020-04-23T05:00:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.