Model-free feature selection to facilitate automatic discovery of
divergent subgroups in tabular data
- URL: http://arxiv.org/abs/2203.04386v1
- Date: Tue, 8 Mar 2022 20:42:56 GMT
- Title: Model-free feature selection to facilitate automatic discovery of
divergent subgroups in tabular data
- Authors: Girmaw Abebe Tadesse, William Ogallo, Celia Cintas, Skyler Speakman
- Abstract summary: We propose a model-free and sparsity-based automatic feature selection (SAFS) framework to facilitate automatic discovery of divergent subgroups.
We validated SAFS across two publicly available datasets (MIMIC-III and Allstate Claims) and compared it with six existing feature selection methods.
- Score: 4.551615447454768
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data-centric AI encourages the need of cleaning and understanding of data in
order to achieve trustworthy AI. Existing technologies, such as AutoML, make it
easier to design and train models automatically, but there is a lack of a
similar level of capabilities to extract data-centric insights. Manual
stratification of tabular data per a feature (e.g., gender) is limited to scale
up for higher feature dimension, which could be addressed using automatic
discovery of divergent subgroups. Nonetheless, these automatic discovery
techniques often search across potentially exponential combinations of features
that could be simplified using a preceding feature selection step. Existing
feature selection techniques for tabular data often involve fitting a
particular model in order to select important features. However, such
model-based selection is prone to model-bias and spurious correlations in
addition to requiring extra resource to design, fine-tune and train a model. In
this paper, we propose a model-free and sparsity-based automatic feature
selection (SAFS) framework to facilitate automatic discovery of divergent
subgroups. Different from filter-based selection techniques, we exploit the
sparsity of objective measures among feature values to rank and select
features. We validated SAFS across two publicly available datasets (MIMIC-III
and Allstate Claims) and compared it with six existing feature selection
methods. SAFS achieves a reduction of feature selection time by a factor of 81x
and 104x, averaged cross the existing methods in the MIMIC-III and Claims
datasets respectively. SAFS-selected features are also shown to achieve
competitive detection performance, e.g., 18.3% of features selected by SAFS in
the Claims dataset detected divergent samples similar to those detected by
using the whole features with a Jaccard similarity of 0.95 but with a 16x
reduction in detection time.
Related papers
- Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for Lifelong Instruction Tuning.
We construct pseudo-skill clusters by grouping gradient-based sample vectors.
We select the best-performing data selector for each skill cluster from a pool of selector experts.
arXiv Detail & Related papers (2024-10-14T15:48:09Z) - Feature Selection as Deep Sequential Generative Learning [50.00973409680637]
We develop a deep variational transformer model over a joint of sequential reconstruction, variational, and performance evaluator losses.
Our model can distill feature selection knowledge and learn a continuous embedding space to map feature selection decision sequences into embedding vectors associated with utility scores.
arXiv Detail & Related papers (2024-03-06T16:31:56Z) - Unified View Imputation and Feature Selection Learning for Incomplete
Multi-view Data [13.079847265195127]
Multi-view unsupervised feature selection (MUFS) is an effective technology for reducing dimensionality in machine learning.
Existing methods cannot directly deal with incomplete multi-view data where some samples are missing in certain views.
UNIFIER explores the local structure of multi-view data by adaptively learning similarity-induced graphs from both the sample and feature spaces.
arXiv Detail & Related papers (2024-01-19T08:26:44Z) - Automated Model Selection for Tabular Data [0.1797555376258229]
R's mixed effect linear models library allows users to provide interactive feature combinations in the model design.
We aim to automate the model selection process for predictions on datasets incorporating feature interactions.
The framework includes two distinct approaches for feature selection: a Priority-based Random Grid Search and a Greedy Search method.
arXiv Detail & Related papers (2024-01-01T21:41:20Z) - A Performance-Driven Benchmark for Feature Selection in Tabular Deep
Learning [131.2910403490434]
Data scientists typically collect as many features as possible into their datasets, and even engineer new features from existing ones.
Existing benchmarks for tabular feature selection consider classical downstream models, toy synthetic datasets, or do not evaluate feature selectors on the basis of downstream performance.
We construct a challenging feature selection benchmark evaluated on downstream neural networks including transformers.
We also propose an input-gradient-based analogue of Lasso for neural networks that outperforms classical feature selection methods on challenging problems.
arXiv Detail & Related papers (2023-11-10T05:26:10Z) - Causal Feature Selection via Transfer Entropy [59.999594949050596]
Causal discovery aims to identify causal relationships between features with observational data.
We introduce a new causal feature selection approach that relies on the forward and backward feature selection procedures.
We provide theoretical guarantees on the regression and classification errors for both the exact and the finite-sample cases.
arXiv Detail & Related papers (2023-10-17T08:04:45Z) - Towards Free Data Selection with General-Purpose Models [71.92151210413374]
A desirable data selection algorithm can efficiently choose the most informative samples to maximize the utility of limited annotation budgets.
Current approaches, represented by active learning methods, typically follow a cumbersome pipeline that iterates the time-consuming model training and batch data selection repeatedly.
FreeSel bypasses the heavy batch selection process, achieving a significant improvement in efficiency and being 530x faster than existing active learning methods.
arXiv Detail & Related papers (2023-09-29T15:50:14Z) - Graph-Based Automatic Feature Selection for Multi-Class Classification
via Mean Simplified Silhouette [4.786337974720721]
This paper introduces a novel graph-based filter method for automatic feature selection (abbreviated as GB-AFS)
The method determines the minimum combination of features required to sustain prediction performance.
It does not require any user-defined parameters such as the number of features to select.
arXiv Detail & Related papers (2023-09-05T14:37:31Z) - Learning to Maximize Mutual Information for Dynamic Feature Selection [13.821253491768168]
We consider the dynamic feature selection (DFS) problem where a model sequentially queries features based on the presently available information.
We explore a simpler approach of greedily selecting features based on their conditional mutual information.
The proposed method is shown to recover the greedy policy when trained to optimality, and it outperforms numerous existing feature selection methods in our experiments.
arXiv Detail & Related papers (2023-01-02T08:31:56Z) - Efficient Data-specific Model Search for Collaborative Filtering [56.60519991956558]
Collaborative filtering (CF) is a fundamental approach for recommender systems.
In this paper, motivated by the recent advances in automated machine learning (AutoML), we propose to design a data-specific CF model.
Key here is a new framework that unifies state-of-the-art (SOTA) CF methods and splits them into disjoint stages of input encoding, embedding function, interaction and prediction function.
arXiv Detail & Related papers (2021-06-14T14:30:32Z) - Joint Adaptive Graph and Structured Sparsity Regularization for
Unsupervised Feature Selection [6.41804410246642]
We propose a joint adaptive graph and structured sparsity regularization unsupervised feature selection (JASFS) method.
A subset of optimal features will be selected in group, and the number of selected features will be determined automatically.
Experimental results on eight benchmarks demonstrate the effectiveness and efficiency of the proposed method.
arXiv Detail & Related papers (2020-10-09T08:17:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.