Related papers: A multi-schematic classifier-independent oversampling approach for imbalanced datasets

A multi-schematic classifier-independent oversampling approach for imbalanced datasets

URL: http://arxiv.org/abs/2107.07349v1
Date: Thu, 15 Jul 2021 14:03:24 GMT
Title: A multi-schematic classifier-independent oversampling approach for imbalanced datasets
Authors: Saptarshi Bej, Kristian Schultz, Prashant Srivastava, Markus Wolfien, Olaf Wolkenhauer
Abstract summary: It is evident from previous studies that different oversampling algorithms have different degrees of efficiency with different classifiers. Here, we overcome this problem with a multi-schematic and classifier-independent oversampling approach: ProWRAS. ProWRAS integrates the Localized Random Affine Shadowsampling (LoRAS)algorithm and the Proximity Weighted Synthetic oversampling (ProWSyn) algorithm.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Over 85 oversampling algorithms, mostly extensions of the SMOTE algorithm, have been built over the past two decades, to solve the problem of imbalanced datasets. However, it has been evident from previous studies that different oversampling algorithms have different degrees of efficiency with different classifiers. With numerous algorithms available, it is difficult to decide on an oversampling algorithm for a chosen classifier. Here, we overcome this problem with a multi-schematic and classifier-independent oversampling approach: ProWRAS(Proximity Weighted Random Affine Shadowsampling). ProWRAS integrates the Localized Random Affine Shadowsampling (LoRAS)algorithm and the Proximity Weighted Synthetic oversampling (ProWSyn) algorithm. By controlling the variance of the synthetic samples, as well as a proximity-weighted clustering system of the minority classdata, the ProWRAS algorithm improves performance, compared to algorithms that generate synthetic samples through modelling high dimensional convex spaces of the minority class. ProWRAS has four oversampling schemes, each of which has its unique way to model the variance of the generated data. Most importantly, the performance of ProWRAS with proper choice of oversampling schemes, is independent of the classifier used. We have benchmarked our newly developed ProWRAS algorithm against five sate-of-the-art oversampling models and four different classifiers on 20 publicly available datasets. ProWRAS outperforms other oversampling algorithms in a statistically significant way, in terms of both F1-score and Kappa-score. Moreover, we have introduced a novel measure for classifier independence I-score, and showed quantitatively that ProWRAS performs better, independent of the classifier used. In practice, ProWRAS customizes synthetic sample generation according to a classifier of choice and thereby reduces benchmarking efforts.

Related papers

Enhancing Synthetic Oversampling for Imbalanced Datasets Using Proxima-Orion Neighbors and q-Gaussian Weighting Technique [0.16385815610837165]
We propose a novel oversampling algorithm to increase the number of instances of minority class in an imbalanced dataset. We select two instances, Proxima and Orion, from the set of all minority class instances, based on a combination of relative distance weights and density estimation of majority class instances. We conduct a comprehensive experiment on 42 datasets extracted from KEEL software and eight datasets from the UCI ML repository to evaluate the usefulness of the proposed (PO-QG) algorithm.
arXiv Detail & Related papers (2025-01-27T05:34:19Z)
Scaling LLM Inference with Optimized Sample Compute Allocation [56.524278187351925]
We propose OSCA, an algorithm to find an optimal mix of different inference configurations. Our experiments show that with our learned mixed allocation, we can achieve accuracy better than the best single configuration. OSCA is also shown to be effective in agentic beyond single-turn tasks, achieving a better accuracy on SWE-Bench with 3x less compute than the default configuration.
arXiv Detail & Related papers (2024-10-29T19:17:55Z)
INGB: Informed Nonlinear Granular Ball Oversampling Framework for Noisy Imbalanced Classification [23.9207014576848]
In classification problems, the datasets are usually imbalanced, noisy or complex. An informed nonlinear oversampling framework with the granular ball (INGB) as a new direction of oversampling is proposed in this paper.
arXiv Detail & Related papers (2023-07-03T01:55:20Z)
Towards Automated Imbalanced Learning with Deep Hierarchical Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class. Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class. We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z)
Does Adversarial Oversampling Help us? [10.210871872870737]
We propose a three-player adversarial game-based end-to-end method to handle class imbalance in datasets. Rather than adversarial minority oversampling, we propose an adversarial oversampling (AO) and a data-space oversampling (DO) approach. The effectiveness of our proposed method has been validated with high-dimensional, highly imbalanced and large-scale multi-class datasets.
arXiv Detail & Related papers (2021-08-20T05:43:17Z)
Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment. Policy gradients for local search are often obtained from random perturbations. We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z)
A Method for Handling Multi-class Imbalanced Data by Geometry based Information Sampling and Class Prioritized Synthetic Data Generation (GICaPS) [15.433936272310952]
This paper looks into the problem of handling imbalanced data in a multi-label classification problem. Two novel methods are proposed that exploit the geometric relationship between the feature vectors. The efficacy of the proposed methods is analyzed by solving a generic multi-class recognition problem.
arXiv Detail & Related papers (2020-10-11T04:04:26Z)
Adaptive Sampling for Best Policy Identification in Markov Decision Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model. The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
A Systematic Characterization of Sampling Algorithms for Open-ended Language Generation [71.31905141672529]
We study the widely adopted ancestral sampling algorithms for auto-regressive language models. We identify three key properties that are shared among them: entropy reduction, order preservation, and slope preservation. We find that the set of sampling algorithms that satisfies these properties performs on par with the existing sampling algorithms.
arXiv Detail & Related papers (2020-09-15T17:28:42Z)
A Comparison of Synthetic Oversampling Methods for Multi-class Text Classification [2.28438857884398]
The authors compare oversampling methods for the problem of multi-class topic classification. The SMOTE algorithm underlies one of the most popular oversampling methods. The authors conclude that for this task, the quality of the KNN and SVM algorithms is more influenced by class imbalance than neural networks.
arXiv Detail & Related papers (2020-08-11T11:41:53Z)
Non-Adaptive Adaptive Sampling on Turnstile Streams [57.619901304728366]
We give the first relative-error algorithms for column subset selection, subspace approximation, projective clustering, and volume on turnstile streams that use space sublinear in $n$. Our adaptive sampling procedure has a number of applications to various data summarization problems that either improve state-of-the-art or have only been previously studied in the more relaxed row-arrival model.
arXiv Detail & Related papers (2020-04-23T05:00:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.