Selecting Features by their Resilience to the Curse of Dimensionality
- URL: http://arxiv.org/abs/2304.02455v2
- Date: Mon, 17 Apr 2023 11:56:50 GMT
- Title: Selecting Features by their Resilience to the Curse of Dimensionality
- Authors: Maximilian Stubbemann, Tobias Hille, Tom Hanika
- Abstract summary: Real-world datasets are often of high dimension and effected by the curse of dimensionality.
Here we step in with a novel method that identifies the features that allow to discriminate data subsets of different sizes.
Our experiments show that our method is competitive and commonly outperforms established feature selection methods.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real-world datasets are often of high dimension and effected by the curse of
dimensionality. This hinders their comprehensibility and interpretability. To
reduce the complexity feature selection aims to identify features that are
crucial to learn from said data. While measures of relevance and pairwise
similarities are commonly used, the curse of dimensionality is rarely
incorporated into the process of selecting features. Here we step in with a
novel method that identifies the features that allow to discriminate data
subsets of different sizes. By adapting recent work on computing intrinsic
dimensionalities, our method is able to select the features that can
discriminate data and thus weaken the curse of dimensionality. Our experiments
show that our method is competitive and commonly outperforms established
feature selection methods. Furthermore, we propose an approximation that allows
our method to scale to datasets consisting of millions of data points. Our
findings suggest that features that discriminate data and are connected to a
low intrinsic dimensionality are meaningful for learning procedures.
Related papers
- Automatic feature selection and weighting using Differentiable Information Imbalance [41.452380773977154]
We introduce the Differentiable Information Imbalance (DII), an automatic data analysis method to rank information content between sets of features.
Based on the nearest neighbors according to distances in the ground truth feature space, the method finds a low-dimensional subset of the input features.
By employing the Differentiable Information Imbalance as a loss function, the relative feature weights of the inputs are optimized, simultaneously performing unit alignment and relative importance scaling.
arXiv Detail & Related papers (2024-10-30T11:19:10Z) - Feature Selection from Differentially Private Correlations [35.187113265093615]
High-dimensional regression can leak information about individual datapoints in a dataset.
We employ a correlations-based order statistic to choose important features from a dataset and privatize them.
We find that our method significantly outperforms the established baseline for private feature selection on many datasets.
arXiv Detail & Related papers (2024-08-20T13:54:07Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - A Contrast Based Feature Selection Algorithm for High-dimensional Data
set in Machine Learning [9.596923373834093]
We propose a novel filter feature selection method, ContrastFS, which selects discriminative features based on the discrepancies features shown between different classes.
We validate effectiveness and efficiency of our approach on several widely studied benchmark datasets, results show that the new method performs favorably with negligible computation.
arXiv Detail & Related papers (2024-01-15T05:32:35Z) - Causal Feature Selection via Transfer Entropy [59.999594949050596]
Causal discovery aims to identify causal relationships between features with observational data.
We introduce a new causal feature selection approach that relies on the forward and backward feature selection procedures.
We provide theoretical guarantees on the regression and classification errors for both the exact and the finite-sample cases.
arXiv Detail & Related papers (2023-10-17T08:04:45Z) - Relative intrinsic dimensionality is intrinsic to learning [49.5738281105287]
We introduce a new notion of the intrinsic dimension of a data distribution, which precisely captures the separability properties of the data.
For this intrinsic dimension, the rule of thumb above becomes a law: high intrinsic dimension guarantees highly separable data.
We show thisRelative intrinsic dimension provides both upper and lower bounds on the probability of successfully learning and generalising in a binary classification problem.
arXiv Detail & Related papers (2023-10-10T10:41:45Z) - Intrinsic dimension estimation for discrete metrics [65.5438227932088]
In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces.
We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting.
This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
arXiv Detail & Related papers (2022-07-20T06:38:36Z) - Compactness Score: A Fast Filter Method for Unsupervised Feature
Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features.
Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z) - Feature Selection Using Reinforcement Learning [0.0]
The space of variables or features that can be used to characterize a particular predictor of interest continues to grow exponentially.
Identifying the most characterizing features that minimizes the variance without jeopardizing the bias of our models is critical to successfully training a machine learning model.
arXiv Detail & Related papers (2021-01-23T09:24:37Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z) - Review of Swarm Intelligence-based Feature Selection Methods [3.8848561367220276]
Data mining applications with high dimensional datasets require high speed and accuracy.
One of the dimensionality reduction approaches is feature selection that can increase the accuracy of the data mining task.
State-of-the-art swarm intelligence are studied, and the recent feature selection methods based on these algorithms are reviewed.
arXiv Detail & Related papers (2020-08-07T05:18:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.