Utilizing Semantic Textual Similarity for Clinical Survey Data Feature
Selection
- URL: http://arxiv.org/abs/2308.09892v1
- Date: Sat, 19 Aug 2023 03:10:51 GMT
- Title: Utilizing Semantic Textual Similarity for Clinical Survey Data Feature
Selection
- Authors: Benjamin C. Warner, Ziqi Xu, Simon Haroutounian, Thomas Kannampallil,
Chenyang Lu
- Abstract summary: Machine learning models that attempt to predict outcomes from survey data can overfit and result in poor generalizability.
One remedy to this issue is feature selection, which attempts to select an optimal subset of features to learn upon.
The relationships between feature names and target names can be evaluated using language models (LMs) to produce semantic textual similarity (STS) scores.
We examine the performance using STS to select features directly and in the minimal-redundancy-maximal-relevance (mRMR) algorithm.
- Score: 4.5574502769585745
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Survey data can contain a high number of features while having a
comparatively low quantity of examples. Machine learning models that attempt to
predict outcomes from survey data under these conditions can overfit and result
in poor generalizability. One remedy to this issue is feature selection, which
attempts to select an optimal subset of features to learn upon. A relatively
unexplored source of information in the feature selection process is the usage
of textual names of features, which may be semantically indicative of which
features are relevant to a target outcome. The relationships between feature
names and target names can be evaluated using language models (LMs) to produce
semantic textual similarity (STS) scores, which can then be used to select
features. We examine the performance using STS to select features directly and
in the minimal-redundancy-maximal-relevance (mRMR) algorithm. The performance
of STS as a feature selection metric is evaluated against preliminary survey
data collected as a part of a clinical study on persistent post-surgical pain
(PPSP). The results suggest that features selected with STS can result in
higher performance models compared to traditional feature selection algorithms.
Related papers
- LLM-Select: Feature Selection with Large Language Models [64.5099482021597]
Large language models (LLMs) are capable of selecting the most predictive features, with performance rivaling the standard tools of data science.
Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place.
arXiv Detail & Related papers (2024-07-02T22:23:40Z) - Prompt Optimization with EASE? Efficient Ordering-aware Automated Selection of Exemplars [66.823588073584]
Large language models (LLMs) have shown impressive capabilities in real-world applications.
The quality of these exemplars in the prompt greatly impacts performance.
Existing methods fail to adequately account for the impact of exemplar ordering on the performance.
arXiv Detail & Related papers (2024-05-25T08:23:05Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - A Contrast Based Feature Selection Algorithm for High-dimensional Data
set in Machine Learning [9.596923373834093]
We propose a novel filter feature selection method, ContrastFS, which selects discriminative features based on the discrepancies features shown between different classes.
We validate effectiveness and efficiency of our approach on several widely studied benchmark datasets, results show that the new method performs favorably with negligible computation.
arXiv Detail & Related papers (2024-01-15T05:32:35Z) - A Performance-Driven Benchmark for Feature Selection in Tabular Deep
Learning [131.2910403490434]
Data scientists typically collect as many features as possible into their datasets, and even engineer new features from existing ones.
Existing benchmarks for tabular feature selection consider classical downstream models, toy synthetic datasets, or do not evaluate feature selectors on the basis of downstream performance.
We construct a challenging feature selection benchmark evaluated on downstream neural networks including transformers.
We also propose an input-gradient-based analogue of Lasso for neural networks that outperforms classical feature selection methods on challenging problems.
arXiv Detail & Related papers (2023-11-10T05:26:10Z) - Causal Feature Selection via Transfer Entropy [59.999594949050596]
Causal discovery aims to identify causal relationships between features with observational data.
We introduce a new causal feature selection approach that relies on the forward and backward feature selection procedures.
We provide theoretical guarantees on the regression and classification errors for both the exact and the finite-sample cases.
arXiv Detail & Related papers (2023-10-17T08:04:45Z) - Parallel feature selection based on the trace ratio criterion [4.30274561163157]
This work presents a novel parallel feature selection approach for classification, namely Parallel Feature Selection using Trace criterion (PFST)
Our method uses trace criterion, a measure of class separability used in Fisher's Discriminant Analysis, to evaluate feature usefulness.
The experiments show that our method can produce a small set of features in a fraction of the amount of time by the other methods under comparison.
arXiv Detail & Related papers (2022-03-03T10:50:33Z) - Compactness Score: A Fast Filter Method for Unsupervised Feature
Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features.
Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z) - Correlation Based Feature Subset Selection for Multivariate Time-Series
Data [2.055949720959582]
Correlations in streams of time series data mean that only a small subset of the features are required for a given data mining task.
We propose a technique which does feature subset selection based on the correlation patterns of single feature classifier outputs.
arXiv Detail & Related papers (2021-11-26T17:39:33Z) - Feature Selection Using Reinforcement Learning [0.0]
The space of variables or features that can be used to characterize a particular predictor of interest continues to grow exponentially.
Identifying the most characterizing features that minimizes the variance without jeopardizing the bias of our models is critical to successfully training a machine learning model.
arXiv Detail & Related papers (2021-01-23T09:24:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.