Multi-granularity Relabeled Under-sampling Algorithm for Imbalanced Data
- URL: http://arxiv.org/abs/2201.03957v1
- Date: Tue, 11 Jan 2022 14:07:55 GMT
- Title: Multi-granularity Relabeled Under-sampling Algorithm for Imbalanced Data
- Authors: Qi Dai, Jian-wei Liu, Yang Liu
- Abstract summary: The imbalanced classification problem turns out to be one of the important and challenging problems in data mining and machine learning.
The Tomek-Link sampling algorithm can effectively reduce the class overlap on data, remove the majority instances that are difficult to distinguish, and improve the algorithm classification accuracy.
However, the Tomek-Links under-sampling algorithm only considers the boundary instances that are the nearest neighbors to each other globally and ignores the potential local overlapping instances.
This paper proposes a multi-granularity relabeled under-sampling algorithm (MGRU) which fully considers the local information of the data set in the
- Score: 15.030895782548576
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The imbalanced classification problem turns out to be one of the important
and challenging problems in data mining and machine learning. The performances
of traditional classifiers will be severely affected by many data problems,
such as class imbalanced problem, class overlap and noise. The Tomek-Link
algorithm was only used to clean data when it was proposed. In recent years,
there have been reports of combining Tomek-Link algorithm with sampling
technique. The Tomek-Link sampling algorithm can effectively reduce the class
overlap on data, remove the majority instances that are difficult to
distinguish, and improve the algorithm classification accuracy. However, the
Tomek-Links under-sampling algorithm only considers the boundary instances that
are the nearest neighbors to each other globally and ignores the potential
local overlapping instances. When the number of minority instances is small,
the under-sampling effect is not satisfactory, and the performance improvement
of the classification model is not obvious. Therefore, on the basis of
Tomek-Link, a multi-granularity relabeled under-sampling algorithm (MGRU) is
proposed. This algorithm fully considers the local information of the data set
in the local granularity subspace, and detects the local potential overlapping
instances in the data set. Then, the overlapped majority instances are
eliminated according to the global relabeled index value, which effectively
expands the detection range of Tomek-Links. The simulation results show that
when we select the optimal global relabeled index value for under-sampling, the
classification accuracy and generalization performance of the proposed
under-sampling algorithm are significantly better than other baseline
algorithms.
Related papers
- Large-scale Fully-Unsupervised Re-Identification [78.47108158030213]
We propose two strategies to learn from large-scale unlabeled data.
The first strategy performs a local neighborhood sampling to reduce the dataset size in each without violating neighborhood relationships.
A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n2) to O(kn) with k n.
arXiv Detail & Related papers (2023-07-26T16:19:19Z) - Unsupervised anomaly detection algorithms on real-world data: how many
do we need? [1.4610038284393165]
This study is the largest comparison of unsupervised anomaly detection algorithms to date.
On the local datasets the $k$NN ($k$-nearest neighbor) algorithm comes out on top.
On the global datasets the EIF (extended isolation forest) algorithm performs the best.
arXiv Detail & Related papers (2023-05-01T09:27:42Z) - Towards Automated Imbalanced Learning with Deep Hierarchical
Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class.
Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class.
We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z) - SSDBCODI: Semi-Supervised Density-Based Clustering with Outliers
Detection Integrated [1.8444322599555096]
Clustering analysis is one of the critical tasks in machine learning.
Due to the fact that the performance of clustering clustering can be significantly eroded by outliers, algorithms try to incorporate the process of outlier detection.
We have proposed SSDBCODI, a semi-supervised detection element.
arXiv Detail & Related papers (2022-08-10T21:06:38Z) - Undersampling is a Minimax Optimal Robustness Intervention in
Nonparametric Classification [28.128464387420216]
We show that learning is fundamentally constrained by a lack of minority group samples.
In particular, in the case of label shift we show that there is always an undersampling algorithm that is minimax optimal.
arXiv Detail & Related papers (2022-05-26T00:35:11Z) - Meta Clustering Learning for Large-scale Unsupervised Person
Re-identification [124.54749810371986]
We propose a "small data for big task" paradigm dubbed Meta Clustering Learning (MCL)
MCL only pseudo-labels a subset of the entire unlabeled data via clustering to save computing for the first-phase training.
Our method significantly saves computational cost while achieving a comparable or even better performance compared to prior works.
arXiv Detail & Related papers (2021-11-19T04:10:18Z) - A Novel Resampling Technique for Imbalanced Dataset Optimization [1.0323063834827415]
classification of rare events is a common problem in many domains, such as fraudulent transactions, malware traffic analysis and network intrusion detection.
We develop two versions of Generative Silhouette Resampling 1-Nearest Neighbour (G1Nos) oversampling algorithms for dealing with class imbalance problem.
arXiv Detail & Related papers (2020-12-30T17:17:08Z) - Clustering of Big Data with Mixed Features [3.3504365823045044]
We develop a new clustering algorithm for large data of mixed type.
The algorithm is capable of detecting outliers and clusters of relatively lower density values.
We present experimental results to verify that our algorithm works well in practice.
arXiv Detail & Related papers (2020-11-11T19:54:38Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Differentially Private Clustering: Tight Approximation Ratios [57.89473217052714]
We give efficient differentially private algorithms for basic clustering problems.
Our results imply an improved algorithm for the Sample and Aggregate privacy framework.
One of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.
arXiv Detail & Related papers (2020-08-18T16:22:06Z) - Improving Face Recognition by Clustering Unlabeled Faces in the Wild [77.48677160252198]
We propose a novel identity separation method based on extreme value theory.
It greatly reduces the problems caused by overlapping-identity label noise.
Experiments on both controlled and real settings demonstrate our method's consistent improvements.
arXiv Detail & Related papers (2020-07-14T12:26:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.