A Method for Handling Multi-class Imbalanced Data by Geometry based
Information Sampling and Class Prioritized Synthetic Data Generation (GICaPS)
- URL: http://arxiv.org/abs/2010.05155v1
- Date: Sun, 11 Oct 2020 04:04:26 GMT
- Title: A Method for Handling Multi-class Imbalanced Data by Geometry based
Information Sampling and Class Prioritized Synthetic Data Generation (GICaPS)
- Authors: Anima Majumder, Samrat Dutta, Swagat Kumar, Laxmidhar Behera
- Abstract summary: This paper looks into the problem of handling imbalanced data in a multi-label classification problem.
Two novel methods are proposed that exploit the geometric relationship between the feature vectors.
The efficacy of the proposed methods is analyzed by solving a generic multi-class recognition problem.
- Score: 15.433936272310952
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper looks into the problem of handling imbalanced data in a
multi-label classification problem. The problem is solved by proposing two
novel methods that primarily exploit the geometric relationship between the
feature vectors. The first one is an undersampling algorithm that uses angle
between feature vectors to select more informative samples while rejecting the
less informative ones. A suitable criterion is proposed to define the
informativeness of a given sample. The second one is an oversampling algorithm
that uses a generative algorithm to create new synthetic data that respects all
class boundaries. This is achieved by finding \emph{no man's land} based on
Euclidean distance between the feature vectors. The efficacy of the proposed
methods is analyzed by solving a generic multi-class recognition problem based
on mixture of Gaussians. The superiority of the proposed algorithms is
established through comparison with other state-of-the-art methods, including
SMOTE and ADASYN, over ten different publicly available datasets exhibiting
high-to-extreme data imbalance. These two methods are combined into a single
data processing framework and is labeled as ``GICaPS'' to highlight the role of
geometry-based information (GI) sampling and Class-Prioritized Synthesis (CaPS)
in dealing with multi-class data imbalance problem, thereby making a novel
contribution in this field.
Related papers
- A Bilevel Optimization Framework for Imbalanced Data Classification [1.6385815610837167]
We propose a new undersampling approach that avoids the pitfalls of noise and overlap caused by synthetic data.
Instead of undersampling majority data randomly, our method undersamples datapoints based on their ability to improve model loss.
Using improved model loss as a proxy measurement for classification performance, our technique assesses a datapoint's impact on loss and rejects those unable to improve it.
arXiv Detail & Related papers (2024-10-15T01:17:23Z) - Projection based fuzzy least squares twin support vector machine for
class imbalance problems [0.9668407688201361]
We propose a novel fuzzy based approach to deal with class imbalanced as well noisy datasets.
The proposed algorithms are evaluated on several benchmark and synthetic datasets.
arXiv Detail & Related papers (2023-09-27T14:28:48Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Towards Automated Imbalanced Learning with Deep Hierarchical
Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class.
Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class.
We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z) - Riemannian classification of EEG signals with missing values [67.90148548467762]
This paper proposes two strategies to handle missing data for the classification of electroencephalograms.
The first approach estimates the covariance from imputed data with the $k$-nearest neighbors algorithm; the second relies on the observed data by leveraging the observed-data likelihood within an expectation-maximization algorithm.
As results show, the proposed strategies perform better than the classification based on observed data and allow to keep a high accuracy even when the missing data ratio increases.
arXiv Detail & Related papers (2021-10-19T14:24:50Z) - Auto-weighted Multi-view Feature Selection with Graph Optimization [90.26124046530319]
We propose a novel unsupervised multi-view feature selection model based on graph learning.
The contributions are threefold: (1) during the feature selection procedure, the consensus similarity graph shared by different views is learned.
Experiments on various datasets demonstrate the superiority of the proposed method compared with the state-of-the-art methods.
arXiv Detail & Related papers (2021-04-11T03:25:25Z) - Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature
Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization.
We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z) - Handling Imbalanced Data: A Case Study for Binary Class Problems [0.0]
The major issues in terms of solving for classification problems are the issues of Imbalanced data.
This paper focuses on both synthetic oversampling techniques and manually computes synthetic data points to enhance easy comprehension of the algorithms.
We analyze the application of these synthetic oversampling techniques on binary classification problems with different Imbalanced ratios and sample sizes.
arXiv Detail & Related papers (2020-10-09T02:04:14Z) - The Integrity of Machine Learning Algorithms against Software Defect
Prediction [0.0]
This report analyses the performance of the Online Sequential Extreme Learning Machine (OS-ELM) proposed by Liang et.al.
OS-ELM trains faster than conventional deep neural networks and it always converges to the globally optimal solution.
The analysis is carried out on 3 projects KC1, PC4 and PC3 carried out by the NASA group.
arXiv Detail & Related papers (2020-09-05T17:26:56Z) - A Comparison of Synthetic Oversampling Methods for Multi-class Text
Classification [2.28438857884398]
The authors compare oversampling methods for the problem of multi-class topic classification.
The SMOTE algorithm underlies one of the most popular oversampling methods.
The authors conclude that for this task, the quality of the KNN and SVM algorithms is more influenced by class imbalance than neural networks.
arXiv Detail & Related papers (2020-08-11T11:41:53Z) - Learning while Respecting Privacy and Robustness to Distributional
Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model.
The objective is to endow the trained model with robustness against adversarially manipulated input data.
Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.