Related papers: On the (In)Significance of Feature Selection in High-Dimensional Datasets

Related papers

Optimised Feature Subset Selection via Simulated Annealing [39.58317527488534]
We introduce SA-FDR, a novel algorithm for $ell_0$-norm feature selection.<n>We show that SA-FDR consistently selects more compact feature subsets while achieving a high predictive accuracy.<n>As a result, SA-FDR provides a flexible and effective solution for designing interpretable models in high-dimensional settings.
arXiv Detail & Related papers (2025-07-31T13:57:38Z)
Permutation-based multi-objective evolutionary feature selection for high-dimensional data [43.18726655647964]
We propose a novel feature selection method for high-dimensional data, based on the well-known permutation feature importance approach.<n>The proposed method employs a multi-objective evolutionary algorithm to search for candidate feature subsets.<n>The effectiveness of our method has been validated on a set of 24 publicly available high-dimensional datasets.
arXiv Detail & Related papers (2025-01-24T08:11:28Z)
TAROT: Targeted Data Selection via Optimal Transport [64.56083922130269]
TAROT is a targeted data selection framework grounded in optimal transport theory.<n>Previous targeted data selection methods rely on influence-based greedys to enhance domain-specific performance.<n>We evaluate TAROT across multiple tasks, including semantic segmentation, motion prediction, and instruction tuning.
arXiv Detail & Related papers (2024-11-30T10:19:51Z)
Large-scale Multi-objective Feature Selection: A Multi-phase Search Space Shrinking Approach [0.27624021966289597]
Feature selection is a crucial step in machine learning, especially for high-dimensional datasets. This paper proposes a novel large-scale multi-objective evolutionary algorithm based on the search space shrinking, termed LMSSS. The effectiveness of the proposed algorithm is demonstrated through comprehensive experiments on 15 large-scale datasets.
arXiv Detail & Related papers (2024-10-13T23:06:10Z)
LLM-Select: Feature Selection with Large Language Models [64.5099482021597]
Large language models (LLMs) are capable of selecting the most predictive features, with performance rivaling the standard tools of data science.<n>Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place.
arXiv Detail & Related papers (2024-07-02T22:23:40Z)
Feature Selection as Deep Sequential Generative Learning [50.00973409680637]
We develop a deep variational transformer model over a joint of sequential reconstruction, variational, and performance evaluator losses. Our model can distill feature selection knowledge and learn a continuous embedding space to map feature selection decision sequences into embedding vectors associated with utility scores.
arXiv Detail & Related papers (2024-03-06T16:31:56Z)
LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z)
A Performance-Driven Benchmark for Feature Selection in Tabular Deep Learning [131.2910403490434]
Data scientists typically collect as many features as possible into their datasets, and even engineer new features from existing ones. Existing benchmarks for tabular feature selection consider classical downstream models, toy synthetic datasets, or do not evaluate feature selectors on the basis of downstream performance. We construct a challenging feature selection benchmark evaluated on downstream neural networks including transformers. We also propose an input-gradient-based analogue of Lasso for neural networks that outperforms classical feature selection methods on challenging problems.
arXiv Detail & Related papers (2023-11-10T05:26:10Z)
Utilizing Semantic Textual Similarity for Clinical Survey Data Feature Selection [4.5574502769585745]
Machine learning models that attempt to predict outcomes from survey data can overfit and result in poor generalizability. One remedy to this issue is feature selection, which attempts to select an optimal subset of features to learn upon. The relationships between feature names and target names can be evaluated using language models (LMs) to produce semantic textual similarity (STS) scores. We examine the performance using STS to select features directly and in the minimal-redundancy-maximal-relevance (mRMR) algorithm.
arXiv Detail & Related papers (2023-08-19T03:10:51Z)
Data Selection for Language Models via Importance Resampling [90.9263039747723]
We formalize the problem of selecting a subset of a large raw unlabeled dataset to match a desired target distribution. We extend the classic importance resampling approach used in low-dimensions for LM data selection. We instantiate the DSIR framework with hashed n-gram features for efficiency, enabling the selection of 100M documents in 4.5 hours.
arXiv Detail & Related papers (2023-02-06T23:57:56Z)
A-SFS: Semi-supervised Feature Selection based on Multi-task Self-supervision [1.3190581566723918]
We introduce a deep learning-based self-supervised mechanism into feature selection problems. A batch-attention mechanism is designed to generate feature weights according to batch-based feature selection patterns. Experimental results show that A-SFS achieves the highest accuracy in most datasets.
arXiv Detail & Related papers (2022-07-19T04:22:27Z)
Parallel feature selection based on the trace ratio criterion [4.30274561163157]
This work presents a novel parallel feature selection approach for classification, namely Parallel Feature Selection using Trace criterion (PFST) Our method uses trace criterion, a measure of class separability used in Fisher's Discriminant Analysis, to evaluate feature usefulness. The experiments show that our method can produce a small set of features in a fraction of the amount of time by the other methods under comparison.
arXiv Detail & Related papers (2022-03-03T10:50:33Z)
Compactness Score: A Fast Filter Method for Unsupervised Feature Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features. Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z)
Sparse Centroid-Encoder: A Nonlinear Model for Feature Selection [1.2487990897680423]
We develop a sparse implementation of the centroid-encoder for nonlinear data reduction and visualization called Centro Sparseid-Encoder. We also provide a feature selection framework that first ranks each feature by its occurrence, and the optimal number of features is chosen using a validation set. The algorithm is applied to a wide variety of data sets including, single-cell biological data, high dimensional infectious disease data, hyperspectral data, image data, and speech data.
arXiv Detail & Related papers (2022-01-30T20:46:24Z)
Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z)
Feature Selection Using Reinforcement Learning [0.0]
The space of variables or features that can be used to characterize a particular predictor of interest continues to grow exponentially. Identifying the most characterizing features that minimizes the variance without jeopardizing the bias of our models is critical to successfully training a machine learning model.
arXiv Detail & Related papers (2021-01-23T09:24:37Z)
Elastic Net based Feature Ranking and Selection [9.289190508925875]
An intuitive idea is put at the end of multiple times of data splitting and elastic net based feature selection. It concerns the frequency of selected features and uses the frequency as an indicator of feature importance. It achieves competitive or superior performance to elastic net and with consistent selection of fewer features.
arXiv Detail & Related papers (2020-12-30T00:08:36Z)
Joint Adaptive Graph and Structured Sparsity Regularization for Unsupervised Feature Selection [6.41804410246642]
We propose a joint adaptive graph and structured sparsity regularization unsupervised feature selection (JASFS) method. A subset of optimal features will be selected in group, and the number of selected features will be determined automatically. Experimental results on eight benchmarks demonstrate the effectiveness and efficiency of the proposed method.
arXiv Detail & Related papers (2020-10-09T08:17:04Z)
Infinite Feature Selection: A Graph-based Feature Filtering Approach [78.63188057505012]
We propose a filtering feature selection framework that considers subsets of features as paths in a graph. Going to infinite allows to constrain the computational complexity of the selection process. We show that Inf-FS behaves better in almost any situation, that is, when the number of features to keep are fixed a priori.
arXiv Detail & Related papers (2020-06-15T07:20:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.