An Instance Selection Algorithm for Big Data in High imbalanced datasets
based on LSH
- URL: http://arxiv.org/abs/2210.04310v1
- Date: Sun, 9 Oct 2022 17:38:41 GMT
- Title: An Instance Selection Algorithm for Big Data in High imbalanced datasets
based on LSH
- Authors: Germ\'an E. Melo-Acosta and Freddy Duitama-Mu\~noz and Juli\'an D.
Arias-Londo\~no
- Abstract summary: Training Machine Learning models in real contexts often deals with big data sets and imbalance samples where the class of interest is unrepresented.
This work proposes three new methods for instance selection (IS) to be able to deal with large and imbalanced data sets.
Algorithms were developed in the Apache Spark framework, guaranteeing their scalability.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training of Machine Learning (ML) models in real contexts often deals with
big data sets and high-class imbalance samples where the class of interest is
unrepresented (minority class). Practical solutions using classical ML models
address the problem of large data sets using parallel/distributed
implementations of training algorithms, approximate model-based solutions, or
applying instance selection (IS) algorithms to eliminate redundant information.
However, the combined problem of big and high imbalanced datasets has been less
addressed. This work proposes three new methods for IS to be able to deal with
large and imbalanced data sets. The proposed methods use Locality Sensitive
Hashing (LSH) as a base clustering technique, and then three different sampling
methods are applied on top of the clusters (or buckets) generated by LSH. The
algorithms were developed in the Apache Spark framework, guaranteeing their
scalability. The experiments carried out in three different datasets suggest
that the proposed IS methods can improve the performance of a base ML model
between 5% and 19% in terms of the geometric mean.
Related papers
- A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Clustering based on Mixtures of Sparse Gaussian Processes [6.939768185086753]
How to cluster data using their low dimensional embedded space is still a challenging problem in machine learning.
In this article, we focus on proposing a joint formulation for both clustering and dimensionality reduction.
Our algorithm is based on a mixture of sparse Gaussian processes, which is called Sparse Gaussian Process Mixture Clustering (SGP-MIC)
arXiv Detail & Related papers (2023-03-23T20:44:36Z) - Research on Efficient Fuzzy Clustering Method Based on Local Fuzzy
Granular balls [67.33923111887933]
In this paper, the data is fuzzy iterated using granular-balls, and the membership degree of data only considers the two granular-balls where it is located.
The formed fuzzy granular-balls set can use more processing methods in the face of different data scenarios.
arXiv Detail & Related papers (2023-03-07T01:52:55Z) - Unified Multi-View Orthonormal Non-Negative Graph Based Clustering
Framework [74.25493157757943]
We formulate a novel clustering model, which exploits the non-negative feature property and incorporates the multi-view information into a unified joint learning framework.
We also explore, for the first time, the multi-model non-negative graph-based approach to clustering data based on deep features.
arXiv Detail & Related papers (2022-11-03T08:18:27Z) - A Hybrid Approach for Binary Classification of Imbalanced Data [0.0]
We propose HADR, a hybrid approach with dimension reduction that consists of data block construction, dimentionality reduction, and ensemble learning.
We evaluate the performance on eight imbalanced public datasets in terms of recall, G-mean, and AUC.
arXiv Detail & Related papers (2022-07-06T15:18:41Z) - Envelope Imbalance Learning Algorithm based on Multilayer Fuzzy C-means
Clustering and Minimum Interlayer discrepancy [14.339674126923903]
This paper proposes a deep instance envelope network-based imbalanced learning algorithm with the multilayer fuzzy c-means (MlFCM) and a minimum interlayer discrepancy mechanism based on the maximum mean discrepancy (MIDMD)
This algorithm can guarantee high quality balanced instances using a deep instance envelope network in the absence of prior knowledge.
arXiv Detail & Related papers (2021-11-02T04:59:57Z) - A Deep Learning Object Detection Method for an Efficient Clusters
Initialization [6.365889364810239]
Clustering has been used in numerous applications such as banking customers profiling, document retrieval, image segmentation, and e-commerce recommendation engines.
Existing clustering techniques present significant limitations, from which is the dependability of their stability on the initialization parameters.
This paper proposes a solution that can provide near-optimal clustering parameters with low computational and resources overhead.
arXiv Detail & Related papers (2021-04-28T08:34:25Z) - An Online Method for A Class of Distributionally Robust Optimization
with Non-Convex Objectives [54.29001037565384]
We propose a practical online method for solving a class of online distributionally robust optimization (DRO) problems.
Our studies demonstrate important applications in machine learning for improving the robustness of networks.
arXiv Detail & Related papers (2020-06-17T20:19:25Z) - Learnable Subspace Clustering [76.2352740039615]
We develop a learnable subspace clustering paradigm to efficiently solve the large-scale subspace clustering problem.
The key idea is to learn a parametric function to partition the high-dimensional subspaces into their underlying low-dimensional subspaces.
To the best of our knowledge, this paper is the first work to efficiently cluster millions of data points among the subspace clustering methods.
arXiv Detail & Related papers (2020-04-09T12:53:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.