A Novel Resampling Technique for Imbalanced Dataset Optimization
- URL: http://arxiv.org/abs/2012.15231v1
- Date: Wed, 30 Dec 2020 17:17:08 GMT
- Title: A Novel Resampling Technique for Imbalanced Dataset Optimization
- Authors: Ivan Letteri, Antonio Di Cecco, Abeer Dyoub, Giuseppe Della Penna
- Abstract summary: classification of rare events is a common problem in many domains, such as fraudulent transactions, malware traffic analysis and network intrusion detection.
We develop two versions of Generative Silhouette Resampling 1-Nearest Neighbour (G1Nos) oversampling algorithms for dealing with class imbalance problem.
- Score: 1.0323063834827415
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the enormous amount of data, particular events of interest can still
be quite rare. Classification of rare events is a common problem in many
domains, such as fraudulent transactions, malware traffic analysis and network
intrusion detection. Many studies have been developed for malware detection
using machine learning approaches on various datasets, but as far as we know
only the MTA-KDD'19 dataset has the peculiarity of updating the representative
set of malicious traffic on a daily basis. This daily updating is the added
value of the dataset, but it translates into a potential due to the class
imbalance problem that the RRw-Optimized MTA-KDD'19 will occur. We capture
difficulties of class distribution in real datasets by considering four types
of minority class examples: safe, borderline, rare and outliers. In this work,
we developed two versions of Generative Silhouette Resampling 1-Nearest
Neighbour (G1Nos) oversampling algorithms for dealing with class imbalance
problem. The first module of G1Nos algorithms performs a coefficient-based
instance selection silhouette identifying the critical threshold of Imbalance
Degree. (ID), the second module generates synthetic samples using a SMOTE-like
oversampling algorithm. The balancing of the classes is done by our G1Nos
algorithms to re-establish the proportions between the two classes of the used
dataset. The experimental results show that our oversampling algorithm work
better than the other two SOTA methodologies in all the metrics considered.
Related papers
- ROG$_{PL}$: Robust Open-Set Graph Learning via Region-Based Prototype
Learning [52.60434474638983]
We propose a unified framework named ROG$_PL$ to achieve robust open-set learning on complex noisy graph data.
The framework consists of two modules, i.e., denoising via label propagation and open-set prototype learning via regions.
To the best of our knowledge, the proposed ROG$_PL$ is the first robust open-set node classification method for graph data with complex noise.
arXiv Detail & Related papers (2024-02-28T17:25:06Z) - Large-scale Fully-Unsupervised Re-Identification [78.47108158030213]
We propose two strategies to learn from large-scale unlabeled data.
The first strategy performs a local neighborhood sampling to reduce the dataset size in each without violating neighborhood relationships.
A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n2) to O(kn) with k n.
arXiv Detail & Related papers (2023-07-26T16:19:19Z) - Evaluating Graph Neural Networks for Link Prediction: Current Pitfalls
and New Benchmarking [66.83273589348758]
Link prediction attempts to predict whether an unseen edge exists based on only a portion of edges of a graph.
A flurry of methods have been introduced in recent years that attempt to make use of graph neural networks (GNNs) for this task.
New and diverse datasets have also been created to better evaluate the effectiveness of these new models.
arXiv Detail & Related papers (2023-06-18T01:58:59Z) - AnoRand: A Semi Supervised Deep Learning Anomaly Detection Method by
Random Labeling [0.0]
Anomaly detection or more generally outliers detection is one of the most popular and challenging subject in theoretical and applied machine learning.
We present a new semi-supervised anomaly detection method called textbfAnoRand by combining a deep learning architecture with random synthetic label generation.
arXiv Detail & Related papers (2023-05-28T10:53:34Z) - Intra-class Adaptive Augmentation with Neighbor Correction for Deep
Metric Learning [99.14132861655223]
We propose a novel intra-class adaptive augmentation (IAA) framework for deep metric learning.
We reasonably estimate intra-class variations for every class and generate adaptive synthetic samples to support hard samples mining.
Our method significantly improves and outperforms the state-of-the-art methods on retrieval performances by 3%-6%.
arXiv Detail & Related papers (2022-11-29T14:52:38Z) - Towards Automated Imbalanced Learning with Deep Hierarchical
Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class.
Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class.
We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z) - A Novel Hybrid Sampling Framework for Imbalanced Learning [0.0]
"SMOTE-RUS-NC" has been compared with other state-of-the-art sampling techniques.
Rigorous experimentation has been conducted on 26 imbalanced datasets.
arXiv Detail & Related papers (2022-08-20T07:04:00Z) - Multi-granularity Relabeled Under-sampling Algorithm for Imbalanced Data [15.030895782548576]
The imbalanced classification problem turns out to be one of the important and challenging problems in data mining and machine learning.
The Tomek-Link sampling algorithm can effectively reduce the class overlap on data, remove the majority instances that are difficult to distinguish, and improve the algorithm classification accuracy.
However, the Tomek-Links under-sampling algorithm only considers the boundary instances that are the nearest neighbors to each other globally and ignores the potential local overlapping instances.
This paper proposes a multi-granularity relabeled under-sampling algorithm (MGRU) which fully considers the local information of the data set in the
arXiv Detail & Related papers (2022-01-11T14:07:55Z) - SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory.
We propose StreaMRAK - a streaming version of KRR.
We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z) - A Method for Handling Multi-class Imbalanced Data by Geometry based
Information Sampling and Class Prioritized Synthetic Data Generation (GICaPS) [15.433936272310952]
This paper looks into the problem of handling imbalanced data in a multi-label classification problem.
Two novel methods are proposed that exploit the geometric relationship between the feature vectors.
The efficacy of the proposed methods is analyzed by solving a generic multi-class recognition problem.
arXiv Detail & Related papers (2020-10-11T04:04:26Z) - The Integrity of Machine Learning Algorithms against Software Defect
Prediction [0.0]
This report analyses the performance of the Online Sequential Extreme Learning Machine (OS-ELM) proposed by Liang et.al.
OS-ELM trains faster than conventional deep neural networks and it always converges to the globally optimal solution.
The analysis is carried out on 3 projects KC1, PC4 and PC3 carried out by the NASA group.
arXiv Detail & Related papers (2020-09-05T17:26:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.