Related papers: A Novel Resampling Technique for Imbalanced Dataset Optimization

A Novel Resampling Technique for Imbalanced Dataset Optimization

URL: http://arxiv.org/abs/2012.15231v1
Date: Wed, 30 Dec 2020 17:17:08 GMT
Title: A Novel Resampling Technique for Imbalanced Dataset Optimization
Authors: Ivan Letteri, Antonio Di Cecco, Abeer Dyoub, Giuseppe Della Penna
Abstract summary: classification of rare events is a common problem in many domains, such as fraudulent transactions, malware traffic analysis and network intrusion detection. We develop two versions of Generative Silhouette Resampling 1-Nearest Neighbour (G1Nos) oversampling algorithms for dealing with class imbalance problem.
Score: 1.0323063834827415
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the enormous amount of data, particular events of interest can still be quite rare. Classification of rare events is a common problem in many domains, such as fraudulent transactions, malware traffic analysis and network intrusion detection. Many studies have been developed for malware detection using machine learning approaches on various datasets, but as far as we know only the MTA-KDD'19 dataset has the peculiarity of updating the representative set of malicious traffic on a daily basis. This daily updating is the added value of the dataset, but it translates into a potential due to the class imbalance problem that the RRw-Optimized MTA-KDD'19 will occur. We capture difficulties of class distribution in real datasets by considering four types of minority class examples: safe, borderline, rare and outliers. In this work, we developed two versions of Generative Silhouette Resampling 1-Nearest Neighbour (G1Nos) oversampling algorithms for dealing with class imbalance problem. The first module of G1Nos algorithms performs a coefficient-based instance selection silhouette identifying the critical threshold of Imbalance Degree. (ID), the second module generates synthetic samples using a SMOTE-like oversampling algorithm. The balancing of the classes is done by our G1Nos algorithms to re-establish the proportions between the two classes of the used dataset. The experimental results show that our oversampling algorithm work better than the other two SOTA methodologies in all the metrics considered.

Related papers

ROG$_{PL}$: Robust Open-Set Graph Learning via Region-Based Prototype Learning [52.60434474638983]
We propose a unified framework named ROG$_PL$ to achieve robust open-set learning on complex noisy graph data. The framework consists of two modules, i.e., denoising via label propagation and open-set prototype learning via regions. To the best of our knowledge, the proposed ROG$_PL$ is the first robust open-set node classification method for graph data with complex noise.
arXiv Detail & Related papers (2024-02-28T17:25:06Z)
Large-scale Fully-Unsupervised Re-Identification [78.47108158030213]
We propose two strategies to learn from large-scale unlabeled data. The first strategy performs a local neighborhood sampling to reduce the dataset size in each without violating neighborhood relationships. A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n2) to O(kn) with k n.
arXiv Detail & Related papers (2023-07-26T16:19:19Z)
Evaluating Graph Neural Networks for Link Prediction: Current Pitfalls and New Benchmarking [66.83273589348758]
Link prediction attempts to predict whether an unseen edge exists based on only a portion of edges of a graph. A flurry of methods have been introduced in recent years that attempt to make use of graph neural networks (GNNs) for this task. New and diverse datasets have also been created to better evaluate the effectiveness of these new models.
arXiv Detail & Related papers (2023-06-18T01:58:59Z)
AnoRand: A Semi Supervised Deep Learning Anomaly Detection Method by Random Labeling [0.0]
Anomaly detection or more generally outliers detection is one of the most popular and challenging subject in theoretical and applied machine learning. We present a new semi-supervised anomaly detection method called textbfAnoRand by combining a deep learning architecture with random synthetic label generation.
arXiv Detail & Related papers (2023-05-28T10:53:34Z)
Intra-class Adaptive Augmentation with Neighbor Correction for Deep Metric Learning [99.14132861655223]
We propose a novel intra-class adaptive augmentation (IAA) framework for deep metric learning. We reasonably estimate intra-class variations for every class and generate adaptive synthetic samples to support hard samples mining. Our method significantly improves and outperforms the state-of-the-art methods on retrieval performances by 3%-6%.
arXiv Detail & Related papers (2022-11-29T14:52:38Z)
Towards Automated Imbalanced Learning with Deep Hierarchical Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class. Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class. We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z)
A Novel Hybrid Sampling Framework for Imbalanced Learning [0.0]
"SMOTE-RUS-NC" has been compared with other state-of-the-art sampling techniques. Rigorous experimentation has been conducted on 26 imbalanced datasets.
arXiv Detail & Related papers (2022-08-20T07:04:00Z)
Multi-granularity Relabeled Under-sampling Algorithm for Imbalanced Data [15.030895782548576]
The imbalanced classification problem turns out to be one of the important and challenging problems in data mining and machine learning. The Tomek-Link sampling algorithm can effectively reduce the class overlap on data, remove the majority instances that are difficult to distinguish, and improve the algorithm classification accuracy. However, the Tomek-Links under-sampling algorithm only considers the boundary instances that are the nearest neighbors to each other globally and ignores the potential local overlapping instances. This paper proposes a multi-granularity relabeled under-sampling algorithm (MGRU) which fully considers the local information of the data set in the
arXiv Detail & Related papers (2022-01-11T14:07:55Z)
SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory. We propose StreaMRAK - a streaming version of KRR. We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z)
A Method for Handling Multi-class Imbalanced Data by Geometry based Information Sampling and Class Prioritized Synthetic Data Generation (GICaPS) [15.433936272310952]
This paper looks into the problem of handling imbalanced data in a multi-label classification problem. Two novel methods are proposed that exploit the geometric relationship between the feature vectors. The efficacy of the proposed methods is analyzed by solving a generic multi-class recognition problem.
arXiv Detail & Related papers (2020-10-11T04:04:26Z)
The Integrity of Machine Learning Algorithms against Software Defect Prediction [0.0]
This report analyses the performance of the Online Sequential Extreme Learning Machine (OS-ELM) proposed by Liang et.al. OS-ELM trains faster than conventional deep neural networks and it always converges to the globally optimal solution. The analysis is carried out on 3 projects KC1, PC4 and PC3 carried out by the NASA group.
arXiv Detail & Related papers (2020-09-05T17:26:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.