Filter Methods for Feature Selection in Supervised Machine Learning
Applications -- Review and Benchmark
- URL: http://arxiv.org/abs/2111.12140v1
- Date: Tue, 23 Nov 2021 20:20:24 GMT
- Title: Filter Methods for Feature Selection in Supervised Machine Learning
Applications -- Review and Benchmark
- Authors: Konstantin Hopf, Sascha Reifenrath
- Abstract summary: This review synthesizes the literature on feature selection benchmarking and evaluates the performance of 58 methods in the widely used R environment.
We consider four typical dataset scenarios that are challenging for ML models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The amount of data for machine learning (ML) applications is constantly
growing. Not only the number of observations, especially the number of measured
variables (features) increases with ongoing digitization. Selecting the most
appropriate features for predictive modeling is an important lever for the
success of ML applications in business and research. Feature selection methods
(FSM) that are independent of a certain ML algorithm - so-called filter methods
- have been numerously suggested, but little guidance for researchers and
quantitative modelers exists to choose appropriate approaches for typical ML
problems. This review synthesizes the substantial literature on feature
selection benchmarking and evaluates the performance of 58 methods in the
widely used R environment. For concrete guidance, we consider four typical
dataset scenarios that are challenging for ML models (noisy, redundant,
imbalanced data and cases with more features than observations). Drawing on the
experience of earlier benchmarks, which have considered much fewer FSMs, we
compare the performance of the methods according to four criteria (predictive
performance, number of relevant features selected, stability of the feature
sets and runtime). We found methods relying on the random forest approach, the
double input symmetrical relevance filter (DISR) and the joint impurity filter
(JIM) were well-performing candidate methods for the given dataset scenarios.
Related papers
- An incremental preference elicitation-based approach to learning potentially non-monotonic preferences in multi-criteria sorting [53.36437745983783]
We first construct a max-margin optimization-based model to model potentially non-monotonic preferences.
We devise information amount measurement methods and question selection strategies to pinpoint the most informative alternative in each iteration.
Two incremental preference elicitation-based algorithms are developed to learn potentially non-monotonic preferences.
arXiv Detail & Related papers (2024-09-04T14:36:20Z) - LLM-Select: Feature Selection with Large Language Models [64.5099482021597]
Large language models (LLMs) are capable of selecting the most predictive features, with performance rivaling the standard tools of data science.
Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place.
arXiv Detail & Related papers (2024-07-02T22:23:40Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - Binary Feature Mask Optimization for Feature Selection [0.0]
We introduce a novel framework that selects features considering the predictions of the model.
Our framework innovates by using a novel feature masking approach to eliminate the features during the selection process.
We demonstrate significant performance improvements on the real-life datasets using LightGBM and Multi-Layer Perceptron as our ML models.
arXiv Detail & Related papers (2024-01-23T10:54:13Z) - A model-free feature selection technique of feature screening and random
forest based recursive feature elimination [0.0]
We propose a model-free feature selection method for ultra-high dimensional data with mass features.
We show that the proposed method is selection consistent and $L$ consistent under weak regularity conditions.
arXiv Detail & Related papers (2023-02-15T03:39:16Z) - Variational Factorization Machines for Preference Elicitation in
Large-Scale Recommender Systems [17.050774091903552]
We propose a variational formulation of factorization machines (FMs) that can be easily optimized using standard mini-batch descent gradient.
Our algorithm learns an approximate posterior distribution over the user and item parameters, which leads to confidence intervals over the predictions.
We show, using several datasets, that it has comparable or better performance than existing methods in terms of prediction accuracy.
arXiv Detail & Related papers (2022-12-20T00:06:28Z) - Compactness Score: A Fast Filter Method for Unsupervised Feature
Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features.
Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z) - RoMA: Robust Model Adaptation for Offline Model-based Optimization [115.02677045518692]
We consider the problem of searching an input maximizing a black-box objective function given a static dataset of input-output queries.
A popular approach to solving this problem is maintaining a proxy model that approximates the true objective function.
Here, the main challenge is how to avoid adversarially optimized inputs during the search.
arXiv Detail & Related papers (2021-10-27T05:37:12Z) - Efficient Data-specific Model Search for Collaborative Filtering [56.60519991956558]
Collaborative filtering (CF) is a fundamental approach for recommender systems.
In this paper, motivated by the recent advances in automated machine learning (AutoML), we propose to design a data-specific CF model.
Key here is a new framework that unifies state-of-the-art (SOTA) CF methods and splits them into disjoint stages of input encoding, embedding function, interaction and prediction function.
arXiv Detail & Related papers (2021-06-14T14:30:32Z) - Robusta: Robust AutoML for Feature Selection via Reinforcement Learning [24.24652530951966]
We propose the first robust AutoML framework, Robusta--based on reinforcement learning (RL)
We show that the framework is able to improve the model robustness by up to 22% while maintaining competitive accuracy on benign samples.
arXiv Detail & Related papers (2021-01-15T03:12:29Z) - Feature Selection for Huge Data via Minipatch Learning [0.0]
We propose Stable Minipatch Selection (STAMPS) and Adaptive STAMPS.
STAMPS are meta-algorithms that build ensembles of selection events of base feature selectors trained on tiny, (ly-adaptive) random subsets of both the observations and features of the data.
Our approaches are general and can be employed with a variety of existing feature selection strategies and machine learning techniques.
arXiv Detail & Related papers (2020-10-16T17:41:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.