Imbalanced Data Stream Classification using Dynamic Ensemble Selection
- URL: http://arxiv.org/abs/2309.09175v2
- Date: Thu, 28 Sep 2023 17:56:39 GMT
- Title: Imbalanced Data Stream Classification using Dynamic Ensemble Selection
- Authors: Priya.S and Haribharathi Sivakumar and Vijay Arvind.R
- Abstract summary: This work proposes a novel framework for integrating data pre-processing and dynamic ensemble selection.
The proposed framework was evaluated using six artificially generated data streams with differing imbalance ratios.
According to experimental results, data pre-processing combined with Dynamic Ensemble Selection techniques significantly delivers more accuracy.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Modern streaming data categorization faces significant challenges from
concept drift and class imbalanced data. This negatively impacts the output of
the classifier, leading to improper classification. Furthermore, other factors
such as the overlapping of multiple classes limit the extent of the correctness
of the output. This work proposes a novel framework for integrating data
pre-processing and dynamic ensemble selection, by formulating the
classification framework for the nonstationary drifting imbalanced data stream,
which employs the data pre-processing and dynamic ensemble selection
techniques. The proposed framework was evaluated using six artificially
generated data streams with differing imbalance ratios in combination with two
different types of concept drifts. Each stream is composed of 200 chunks of 500
objects described by eight features and contains five concept drifts. Seven
pre-processing techniques and two dynamic ensemble selection methods were
considered. According to experimental results, data pre-processing combined
with Dynamic Ensemble Selection techniques significantly delivers more accuracy
when dealing with imbalanced data streams.
Related papers
- Few-shot learning for COVID-19 Chest X-Ray Classification with
Imbalanced Data: An Inter vs. Intra Domain Study [49.5374512525016]
Medical image datasets are essential for training models used in computer-aided diagnosis, treatment planning, and medical research.
Some challenges are associated with these datasets, including variability in data distribution, data scarcity, and transfer learning issues when using models pre-trained from generic images.
We propose a methodology based on Siamese neural networks in which a series of techniques are integrated to mitigate the effects of data scarcity and distribution imbalance.
arXiv Detail & Related papers (2024-01-18T16:59:27Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - DynED: Dynamic Ensemble Diversification in Data Stream Classification [2.990411348977783]
We present a novel ensemble construction and maintenance approach based on MMR (Maximal Marginal Relevance)
The experimental results on both four real and 11 synthetic datasets demonstrate that the proposed approach provides a higher average mean accuracy compared to the five state-of-the-art baselines.
arXiv Detail & Related papers (2023-08-21T15:56:05Z) - On the Trade-off of Intra-/Inter-class Diversity for Supervised
Pre-training [72.8087629914444]
We study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset.
With the size of the pre-training dataset fixed, the best downstream performance comes with a balance on the intra-/inter-class diversity.
arXiv Detail & Related papers (2023-05-20T16:23:50Z) - Revisiting Long-tailed Image Classification: Survey and Benchmarks with
New Evaluation Metrics [88.39382177059747]
A corpus of metrics is designed for measuring the accuracy, robustness, and bounds of algorithms for learning with long-tailed distribution.
Based on our benchmarks, we re-evaluate the performance of existing methods on CIFAR10 and CIFAR100 datasets.
arXiv Detail & Related papers (2023-02-03T02:40:54Z) - Continual Learning with Optimal Transport based Mixture Model [17.398605698033656]
We propose an online mixture model learning approach based on nice properties of the mature optimal transport theory (OT-MM)
Our proposed method can significantly outperform the current state-of-the-art baselines.
arXiv Detail & Related papers (2022-11-30T06:40:29Z) - Semi-supervised Long-tailed Recognition using Alternate Sampling [95.93760490301395]
Main challenges in long-tailed recognition come from the imbalanced data distribution and sample scarcity in its tail classes.
We propose a new recognition setting, namely semi-supervised long-tailed recognition.
We demonstrate significant accuracy improvements over other competitive methods on two datasets.
arXiv Detail & Related papers (2021-05-01T00:43:38Z) - Data augmentation and feature selection for automatic model
recommendation in computational physics [0.0]
This article introduces two algorithms to address the lack of training data, their high dimensionality, and the non-applicability of common data augmentation techniques to physics data.
When combined with a stacking ensemble made of six multilayer perceptrons and a ridge logistic regression, they enable reaching an accuracy of 90% on our classification problem for nonlinear structural mechanics.
arXiv Detail & Related papers (2021-01-12T15:09:11Z) - Posterior Re-calibration for Imbalanced Datasets [33.379680556475314]
Neural Networks can perform poorly when the training label distribution is heavily imbalanced.
We derive a post-training prior rebalancing technique that can be solved through a KL-divergence based optimization.
Our results on six different datasets and five different architectures show state of art accuracy.
arXiv Detail & Related papers (2020-10-22T15:57:14Z) - stream-learn -- open-source Python library for difficult data stream
batch analysis [0.0]
stream-learn is compatible with scikit-learn and developed for the drifting and imbalanced data stream analysis.
Main component is a stream generator, which allows to produce a synthetic data stream.
In addition, estimators adapted for data stream classification have been implemented.
arXiv Detail & Related papers (2020-01-29T20:15:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.