Smart Data driven Decision Trees Ensemble Methodology for Imbalanced Big
Data
- URL: http://arxiv.org/abs/2001.05759v3
- Date: Fri, 3 Sep 2021 10:23:55 GMT
- Title: Smart Data driven Decision Trees Ensemble Methodology for Imbalanced Big
Data
- Authors: Diego Garc\'ia-Gil, Salvador Garc\'ia, Ning Xiong, Francisco Herrera
- Abstract summary: Split data strategies and lack of data in minority class due to the use of MapReduce paradigm have posed new challenges for tackling imbalanced data problems.
Smart Data refers to data of enough quality to achieve high performance models.
We propose a novel Smart Data driven Decision Trees Ensemble methodology for addressing the imbalanced classification problem in Big Data domains.
- Score: 11.117880929232575
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Differences in data size per class, also known as imbalanced data
distribution, have become a common problem affecting data quality. Big Data
scenarios pose a new challenge to traditional imbalanced classification
algorithms, since they are not prepared to work with such amount of data. Split
data strategies and lack of data in the minority class due to the use of
MapReduce paradigm have posed new challenges for tackling the imbalance between
classes in Big Data scenarios. Ensembles have shown to be able to successfully
address imbalanced data problems. Smart Data refers to data of enough quality
to achieve high performance models. The combination of ensembles and Smart
Data, achieved through Big Data preprocessing, should be a great synergy. In
this paper, we propose a novel Smart Data driven Decision Trees Ensemble
methodology for addressing the imbalanced classification problem in Big Data
domains, namely SD_DeTE methodology. This methodology is based on the learning
of different decision trees using distributed quality data for the ensemble
process. This quality data is achieved by fusing Random Discretization,
Principal Components Analysis and clustering-based Random Oversampling for
obtaining different Smart Data versions of the original data. Experiments
carried out in 21 binary adapted datasets have shown that our methodology
outperforms Random Forest.
Related papers
- Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - A Survey of Methods for Handling Disk Data Imbalance [10.261915886145214]
This paper provides a comprehensive overview of research in the field of imbalanced data classification.
The Backblaze dataset, a widely used dataset related to hard discs, has a small amount of failure data and a large amount of health data, which exhibits a serious class imbalance.
arXiv Detail & Related papers (2023-10-13T05:35:13Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Rethinking Data Heterogeneity in Federated Learning: Introducing a New
Notion and Standard Benchmarks [65.34113135080105]
We show that not only the issue of data heterogeneity in current setups is not necessarily a problem but also in fact it can be beneficial for the FL participants.
Our observations are intuitive.
Our code is available at https://github.com/MMorafah/FL-SC-NIID.
arXiv Detail & Related papers (2022-09-30T17:15:19Z) - Effective Class-Imbalance learning based on SMOTE and Convolutional
Neural Networks [0.1074267520911262]
Imbalanced Data (ID) is a problem that deters Machine Learning (ML) models for achieving satisfactory results.
In this paper, we investigate the effectiveness of methods based on Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs)
In order to achieve reliable results, we conducted our experiments 100 times with randomly shuffled data distributions.
arXiv Detail & Related papers (2022-09-01T07:42:16Z) - Information FOMO: The unhealthy fear of missing out on information. A method for removing misleading data for healthier models [0.0]
Misleading or unnecessary data can have out-sized impacts on the health or accuracy of Machine Learning (ML) models.
We present a sequential selection method that identifies critically important information within a dataset.
We find these instabilities are a result of the complexity of the underlying map and linked to extreme events and heavy tails.
arXiv Detail & Related papers (2022-08-27T19:43:53Z) - Foundations of data imbalance and solutions for a data democracy [0.0]
Dealing with imbalanced data is a prevalent problem while performing classification on the datasets.
Two essential statistical elements are resolved: the degree of class imbalance and the complexity of the concept.
Measures which are appropriate in these scenarios are discussed and implemented on a real-life dataset.
arXiv Detail & Related papers (2021-07-30T20:37:23Z) - Towards Stable Imbalanced Data Classification via Virtual Big Data
Projection [3.3707422585608953]
We investigate the capability of VBD to address deep autoencoder training and imbalanced data classification.
First, we prove that, VBD can significantly decrease the validation loss of autoencoders via providing them a huge diversified training data.
Second, we propose the first projection-based method called cross-concatenation to balance the skewed class distributions without over-sampling.
arXiv Detail & Related papers (2020-08-23T04:01:51Z) - Learning while Respecting Privacy and Robustness to Distributional
Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model.
The objective is to endow the trained model with robustness against adversarially manipulated input data.
Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z) - Long-Tailed Recognition Using Class-Balanced Experts [128.73438243408393]
We propose an ensemble of class-balanced experts that combines the strength of diverse classifiers.
Our ensemble of class-balanced experts reaches results close to state-of-the-art and an extended ensemble establishes a new state-of-the-art on two benchmarks for long-tailed recognition.
arXiv Detail & Related papers (2020-04-07T20:57:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.