Related papers: Smart Data driven Decision Trees Ensemble Methodology for Imbalanced Big Data

Smart Data driven Decision Trees Ensemble Methodology for Imbalanced Big Data

URL: http://arxiv.org/abs/2001.05759v3
Date: Fri, 3 Sep 2021 10:23:55 GMT
Title: Smart Data driven Decision Trees Ensemble Methodology for Imbalanced Big Data
Authors: Diego Garc\'ia-Gil, Salvador Garc\'ia, Ning Xiong, Francisco Herrera
Abstract summary: Split data strategies and lack of data in minority class due to the use of MapReduce paradigm have posed new challenges for tackling imbalanced data problems. Smart Data refers to data of enough quality to achieve high performance models. We propose a novel Smart Data driven Decision Trees Ensemble methodology for addressing the imbalanced classification problem in Big Data domains.
Score: 11.117880929232575
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Differences in data size per class, also known as imbalanced data distribution, have become a common problem affecting data quality. Big Data scenarios pose a new challenge to traditional imbalanced classification algorithms, since they are not prepared to work with such amount of data. Split data strategies and lack of data in the minority class due to the use of MapReduce paradigm have posed new challenges for tackling the imbalance between classes in Big Data scenarios. Ensembles have shown to be able to successfully address imbalanced data problems. Smart Data refers to data of enough quality to achieve high performance models. The combination of ensembles and Smart Data, achieved through Big Data preprocessing, should be a great synergy. In this paper, we propose a novel Smart Data driven Decision Trees Ensemble methodology for addressing the imbalanced classification problem in Big Data domains, namely SD_DeTE methodology. This methodology is based on the learning of different decision trees using distributed quality data for the ensemble process. This quality data is achieved by fusing Random Discretization, Principal Components Analysis and clustering-based Random Oversampling for obtaining different Smart Data versions of the original data. Experiments carried out in 21 binary adapted datasets have shown that our methodology outperforms Random Forest.

Related papers

Geometric Median Matching for Robust k-Subset Selection from Noisy Data [75.86423267723728]
We propose a novel k-subset selection strategy that leverages Geometric Median -- a robust estimator with an optimal breakdown point of 1/2. Our method iteratively selects a k-subset such that the mean of the subset approximates the GM of the (potentially) noisy dataset, ensuring robustness even under arbitrary corruption.
arXiv Detail & Related papers (2025-04-01T09:22:05Z)
Accelerated Methods with Compressed Communications for Distributed Optimization Problems under Data Similarity [55.03958223190181]
We propose the first theoretically grounded accelerated algorithms utilizing unbiased and biased compression under data similarity. Our results are of record and confirmed by experiments on different average losses and datasets.
arXiv Detail & Related papers (2024-12-21T00:40:58Z)
Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes. Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together. We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z)
A Survey of Methods for Handling Disk Data Imbalance [10.261915886145214]
This paper provides a comprehensive overview of research in the field of imbalanced data classification. The Backblaze dataset, a widely used dataset related to hard discs, has a small amount of failure data and a large amount of health data, which exhibits a serious class imbalance.
arXiv Detail & Related papers (2023-10-13T05:35:13Z)
Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers. We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes. We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z)
Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching. Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z)
Rethinking Data Heterogeneity in Federated Learning: Introducing a New Notion and Standard Benchmarks [65.34113135080105]
We show that not only the issue of data heterogeneity in current setups is not necessarily a problem but also in fact it can be beneficial for the FL participants. Our observations are intuitive. Our code is available at https://github.com/MMorafah/FL-SC-NIID.
arXiv Detail & Related papers (2022-09-30T17:15:19Z)
Effective Class-Imbalance learning based on SMOTE and Convolutional Neural Networks [0.1074267520911262]
Imbalanced Data (ID) is a problem that deters Machine Learning (ML) models for achieving satisfactory results. In this paper, we investigate the effectiveness of methods based on Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs) In order to achieve reliable results, we conducted our experiments 100 times with randomly shuffled data distributions.
arXiv Detail & Related papers (2022-09-01T07:42:16Z)
Information FOMO: The unhealthy fear of missing out on information. A method for removing misleading data for healthier models [0.0]
Misleading or unnecessary data can have out-sized impacts on the health or accuracy of Machine Learning (ML) models. We present a sequential selection method that identifies critically important information within a dataset. We find these instabilities are a result of the complexity of the underlying map and linked to extreme events and heavy tails.
arXiv Detail & Related papers (2022-08-27T19:43:53Z)
Foundations of data imbalance and solutions for a data democracy [0.0]
Dealing with imbalanced data is a prevalent problem while performing classification on the datasets. Two essential statistical elements are resolved: the degree of class imbalance and the complexity of the concept. Measures which are appropriate in these scenarios are discussed and implemented on a real-life dataset.
arXiv Detail & Related papers (2021-07-30T20:37:23Z)
Towards Stable Imbalanced Data Classification via Virtual Big Data Projection [3.3707422585608953]
We investigate the capability of VBD to address deep autoencoder training and imbalanced data classification. First, we prove that, VBD can significantly decrease the validation loss of autoencoders via providing them a huge diversified training data. Second, we propose the first projection-based method called cross-concatenation to balance the skewed class distributions without over-sampling.
arXiv Detail & Related papers (2020-08-23T04:01:51Z)
Learning while Respecting Privacy and Robustness to Distributional Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model. The objective is to endow the trained model with robustness against adversarially manipulated input data. Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z)
Long-Tailed Recognition Using Class-Balanced Experts [128.73438243408393]
We propose an ensemble of class-balanced experts that combines the strength of diverse classifiers. Our ensemble of class-balanced experts reaches results close to state-of-the-art and an extended ensemble establishes a new state-of-the-art on two benchmarks for long-tailed recognition.
arXiv Detail & Related papers (2020-04-07T20:57:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.