Imbalanced Big Data Oversampling: Taxonomy, Algorithms, Software,
Guidelines and Future Directions
- URL: http://arxiv.org/abs/2107.11508v1
- Date: Sat, 24 Jul 2021 01:49:46 GMT
- Title: Imbalanced Big Data Oversampling: Taxonomy, Algorithms, Software,
Guidelines and Future Directions
- Authors: William C. Sleeman IV and Bartosz Krawczyk
- Abstract summary: We propose a holistic look on oversampling algorithms for imbalanced big data.
We introduce a Spark library with 14 state-of-the-art oversampling algorithms.
We evaluate the trade-off between accuracy and time complexity of oversampling algorithms.
- Score: 6.436899373275926
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Learning from imbalanced data is among the most challenging areas in
contemporary machine learning. This becomes even more difficult when considered
the context of big data that calls for dedicated architectures capable of
high-performance processing. Apache Spark is a highly efficient and popular
architecture, but it poses specific challenges for algorithms to be implemented
for it. While oversampling algorithms are an effective way for handling class
imbalance, they have not been designed for distributed environments. In this
paper, we propose a holistic look on oversampling algorithms for imbalanced big
data. We discuss the taxonomy of oversampling algorithms and their mechanisms
used to handle skewed class distributions. We introduce a Spark library with 14
state-of-the-art oversampling algorithms implemented and evaluate their
efficacy via extensive experimental study. Using binary and multi-class massive
data sets, we analyze the effectiveness of oversampling algorithms and their
relationships with different types of classifiers. We evaluate the trade-off
between accuracy and time complexity of oversampling algorithms, as well as
their scalability when increasing the size of data. This allows us to gain
insight into the usefulness of specific components of oversampling algorithms
for big data, as well as formulate guidelines and recommendations for designing
future resampling approaches for massive imbalanced data. Our library can be
downloaded from https://github.com/fsleeman/spark-class-balancing.git.
Related papers
- A Mirror Descent-Based Algorithm for Corruption-Tolerant Distributed Gradient Descent [57.64826450787237]
We show how to analyze the behavior of distributed gradient descent algorithms in the presence of adversarial corruptions.
We show how to use ideas from (lazy) mirror descent to design a corruption-tolerant distributed optimization algorithm.
Experiments based on linear regression, support vector classification, and softmax classification on the MNIST dataset corroborate our theoretical findings.
arXiv Detail & Related papers (2024-07-19T08:29:12Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Performance Evaluation and Comparison of a New Regression Algorithm [4.125187280299247]
We compare the performance of a newly proposed regression algorithm against four conventional machine learning algorithms.
The reader is free to replicate our results since we have provided the source code in a GitHub repository.
arXiv Detail & Related papers (2023-06-15T13:01:16Z) - Improving and Benchmarking Offline Reinforcement Learning Algorithms [87.67996706673674]
This work aims to bridge the gaps caused by low-level choices and datasets.
We empirically investigate 20 implementation choices using three representative algorithms.
We find two variants CRR+ and CQL+ achieving new state-of-the-art on D4RL.
arXiv Detail & Related papers (2023-06-01T17:58:46Z) - ParlayANN: Scalable and Deterministic Parallel Graph-Based Approximate
Nearest Neighbor Search Algorithms [5.478671305092084]
We introduce ParlayANN, a library of deterministic and parallel graph-based approximate nearest neighbor search algorithms.
We develop novel parallel implementations for four state-of-the-art graph-based ANNS algorithms that scale to billion-scale datasets.
arXiv Detail & Related papers (2023-05-07T19:28:23Z) - Towards Automated Imbalanced Learning with Deep Hierarchical
Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class.
Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class.
We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z) - Multi-granularity Relabeled Under-sampling Algorithm for Imbalanced Data [15.030895782548576]
The imbalanced classification problem turns out to be one of the important and challenging problems in data mining and machine learning.
The Tomek-Link sampling algorithm can effectively reduce the class overlap on data, remove the majority instances that are difficult to distinguish, and improve the algorithm classification accuracy.
However, the Tomek-Links under-sampling algorithm only considers the boundary instances that are the nearest neighbors to each other globally and ignores the potential local overlapping instances.
This paper proposes a multi-granularity relabeled under-sampling algorithm (MGRU) which fully considers the local information of the data set in the
arXiv Detail & Related papers (2022-01-11T14:07:55Z) - SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory.
We propose StreaMRAK - a streaming version of KRR.
We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z) - A Method for Handling Multi-class Imbalanced Data by Geometry based
Information Sampling and Class Prioritized Synthetic Data Generation (GICaPS) [15.433936272310952]
This paper looks into the problem of handling imbalanced data in a multi-label classification problem.
Two novel methods are proposed that exploit the geometric relationship between the feature vectors.
The efficacy of the proposed methods is analyzed by solving a generic multi-class recognition problem.
arXiv Detail & Related papers (2020-10-11T04:04:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.