Scalable Feature Subset Selection for Big Data using Parallel Hybrid
Evolutionary Algorithm based Wrapper in Apache Spark
- URL: http://arxiv.org/abs/2106.14007v3
- Date: Tue, 25 Jan 2022 05:35:46 GMT
- Title: Scalable Feature Subset Selection for Big Data using Parallel Hybrid
Evolutionary Algorithm based Wrapper in Apache Spark
- Authors: Yelleti Vivek, Vadlamani Ravi and Pisipati Radhakrishna
- Abstract summary: We propose a wrapper for feature subset selection (FSS) based on parallel and distributed hybrid evolutionary algorithms (EAs) under the Apache Spark environment.
The effectiveness of the proposed algorithms is tested over the five large datasets of varying feature space dimension, taken from cyber security and biology domains.
- Score: 4.241208172557663
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Owing to the emergence of large datasets, applying current sequential
wrapper-based feature subset selection (FSS) algorithms increases the
complexity. This limitation motivated us to propose a wrapper for feature
subset selection (FSS) based on parallel and distributed hybrid evolutionary
algorithms (EAs) under the Apache Spark environment. The hybrid EAs are based
on the BDE and Binary Threshold Accepting (BTA), a point-based EA, which is
invoked to enhance the search capability and avoid premature convergence of the
PB-DE. Thus, we designed the hybrid variants (i) parallel binary differential
evolution and threshold accepting (PB-DETA), where DE and TA work in tandem in
every iteration, and (ii) parallel binary threshold accepting and differential
evolution (PB-TADE), where TA and DE work in tandem in every iteration under
the Apache Spark environment. Both PB-DETA and PB-TADE are compared with the
baseline, viz., the parallel version of the binary differential evolution
(PB-DE). All three proposed approaches use logistic regression (LR) to compute
the fitness function, namely, the area under ROC curve (AUC). The effectiveness
of the proposed algorithms is tested over the five large datasets of varying
feature space dimension, taken from cyber security and biology domains. It is
noteworthy that the PB-TADE turned out to be statistically significant compared
to PB-DE and PB-DETA. We reported the speedup analysis, average AUC obtained by
the most repeated feature subset, feature subset with high AUC and least
cardinality.
Related papers
- Adaptive Knowledge-based Multi-Objective Evolutionary Algorithm for Hybrid Flow Shop Scheduling Problems with Multiple Parallel Batch Processing Stages [5.851739146497829]
This study generalizes the problem model, in which users can arbitrarily set certain stages as parallel batch processing stages.
An Adaptive Knowledge-based Multi-Objective Evolutionary Algorithm (AMOEA/D) is designed to simultaneously optimize both makespan and Total Energy Consumption.
The experimental results show that the AMOEA/D is superior to the comparison algorithms in solving the PBHFSP.
arXiv Detail & Related papers (2024-09-27T08:05:56Z) - Evolving a Multi-Population Evolutionary-QAOA on Distributed QPUs [0.0]
We demonstrate that our Evolutionary-QAOA (E-QAOA) pairing performs on par or better than a COBYLA-based QAOA.
We also present a novel approach by presenting a multi-population EA distributed on two QPUs.
arXiv Detail & Related papers (2024-09-16T21:16:51Z) - Poisson Process for Bayesian Optimization [126.51200593377739]
We propose a ranking-based surrogate model based on the Poisson process and introduce an efficient BO framework, namely Poisson Process Bayesian Optimization (PoPBO)
Compared to the classic GP-BO method, our PoPBO has lower costs and better robustness to noise, which is verified by abundant experiments.
arXiv Detail & Related papers (2024-02-05T02:54:50Z) - An Efficient High-Dimensional Gene Selection Approach based on Binary
Horse Herd Optimization Algorithm for Biological Data Classification [1.1510009152620668]
The Horse Herd Optimization Algorithm (HOA) is a new meta-heuristic algorithm based on the behaviors of horses at different ages.
This paper proposes a binary version of the HOA in order to solve discrete problems and select prominent feature subsets.
The proposed hybrid method (MRMR-BHOA) demonstrates superior performance in terms of accuracy and minimum selected features.
arXiv Detail & Related papers (2023-08-18T19:40:59Z) - Massively Parallel Genetic Optimization through Asynchronous Propagation
of Populations [50.591267188664666]
Propulate is an evolutionary optimization algorithm and software package for global optimization.
We provide an MPI-based implementation of our algorithm, which features variants of selection, mutation, crossover, and migration.
We find that Propulate is up to three orders of magnitude faster without sacrificing solution accuracy.
arXiv Detail & Related papers (2023-01-20T18:17:34Z) - Sparse high-dimensional linear regression with a partitioned empirical
Bayes ECM algorithm [62.997667081978825]
We propose a computationally efficient and powerful Bayesian approach for sparse high-dimensional linear regression.
Minimal prior assumptions on the parameters are used through the use of plug-in empirical Bayes estimates.
The proposed approach is implemented in the R package probe.
arXiv Detail & Related papers (2022-09-16T19:15:50Z) - Feature subset selection for Big Data via Chaotic Binary Differential
Evolution under Apache Spark [4.241208172557663]
We propose a novel multiplicative single objective function involving cardinality and AUC.
We embed Logistic and Tent chaotic maps into the Binary Differential Evolution (BDE) and named it as Chaotic Binary Differential Evolution (CBDE)
The results empirically show that the proposed parallel Chaotic Binary Differential Evolution (P-CBDE-iS) is able to find the better quality feature subsets.
arXiv Detail & Related papers (2022-02-08T11:39:40Z) - ES-Based Jacobian Enables Faster Bilevel Optimization [53.675623215542515]
Bilevel optimization (BO) has arisen as a powerful tool for solving many modern machine learning problems.
Existing gradient-based methods require second-order derivative approximations via Jacobian- or/and Hessian-vector computations.
We propose a novel BO algorithm, which adopts Evolution Strategies (ES) based method to approximate the response Jacobian matrix in the hypergradient of BO.
arXiv Detail & Related papers (2021-10-13T19:36:50Z) - Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks.
We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution.
We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z) - Bilevel Optimization: Convergence Analysis and Enhanced Design [63.64636047748605]
Bilevel optimization is a tool for many machine learning problems.
We propose a novel stoc-efficientgradient estimator named stoc-BiO.
arXiv Detail & Related papers (2020-10-15T18:09:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.