A Computational Exploration of Emerging Methods of Variable Importance
Estimation
- URL: http://arxiv.org/abs/2208.03373v1
- Date: Fri, 5 Aug 2022 20:00:56 GMT
- Title: A Computational Exploration of Emerging Methods of Variable Importance
Estimation
- Authors: Louis Mozart Kamdem and Ernest Fokoue
- Abstract summary: Estimating the importance of variables is an essential task in modern machine learning.
We propose a computational and theoretical exploration of the emerging methods of variable importance estimation.
The implementation has shown that PERF has the best performance in the case of highly correlated data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Estimating the importance of variables is an essential task in modern machine
learning. This help to evaluate the goodness of a feature in a given model.
Several techniques for estimating the importance of variables have been
developed during the last decade. In this paper, we proposed a computational
and theoretical exploration of the emerging methods of variable importance
estimation, namely: Least Absolute Shrinkage and Selection Operator (LASSO),
Support Vector Machine (SVM), the Predictive Error Function (PERF), Random
Forest (RF), and Extreme Gradient Boosting (XGBOOST) that were tested on
different kinds of real-life and simulated data. All these methods can handle
both regression and classification tasks seamlessly but all fail when it comes
to dealing with data containing missing values. The implementation has shown
that PERF has the best performance in the case of highly correlated data
closely followed by RF. PERF and XGBOOST are "data-hungry" methods, they had
the worst performance on small data sizes but they are the fastest when it
comes to the execution time. SVM is the most appropriate when many redundant
features are in the dataset. A surplus with the PERF is its natural cut-off at
zero helping to separate positive and negative scores with all positive scores
indicating essential and significant features while the negatives score
indicates useless features. RF and LASSO are very versatile in a way that they
can be used in almost all situations despite they are not giving the best
results.
Related papers
- A Bio-Medical Snake Optimizer System Driven by Logarithmic Surviving Global Search for Optimizing Feature Selection and its application for Disorder Recognition [1.3755153408022656]
It is paramount to enhance medical practices, given how important it is to protect human life.
Medical therapy can be accelerated by automating patient prediction using machine learning techniques.
Several preprocessing strategies must be adopted for their crucial duty in this field.
arXiv Detail & Related papers (2024-02-22T09:08:18Z) - Equation Discovery with Bayesian Spike-and-Slab Priors and Efficient Kernels [57.46832672991433]
We propose a novel equation discovery method based on Kernel learning and BAyesian Spike-and-Slab priors (KBASS)
We use kernel regression to estimate the target function, which is flexible, expressive, and more robust to data sparsity and noises.
We develop an expectation-propagation expectation-maximization algorithm for efficient posterior inference and function estimation.
arXiv Detail & Related papers (2023-10-09T03:55:09Z) - DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and
Diffusion Models [31.65198592956842]
We propose DataInf, an efficient influence approximation method that is practical for large-scale generative AI models.
Our theoretical analysis shows that DataInf is particularly well-suited for parameter-efficient fine-tuning techniques such as LoRA.
In applications to RoBERTa-large, Llama-2-13B-chat, and stable-diffusion-v1.5 models, DataInf effectively identifies the most influential fine-tuning examples better than other approximate influence scores.
arXiv Detail & Related papers (2023-10-02T04:59:19Z) - FAStEN: An Efficient Adaptive Method for Feature Selection and Estimation in High-Dimensional Functional Regressions [7.674715791336311]
We propose a new, flexible and ultra-efficient approach to perform feature selection in a sparse function-on-function regression problem.
We show how to extend it to the scalar-on-function framework.
We present an application to brain fMRI data from the AOMIC PIOP1 study.
arXiv Detail & Related papers (2023-03-26T19:41:17Z) - Model Optimization in Imbalanced Regression [2.580765958706854]
Imbalanced domain learning aims to produce accurate models in predicting instances that, though underrepresented, are of utmost importance for the domain.
One of the main reasons for this is the lack of loss functions capable of focusing on minimizing the errors of extreme (rare) values.
Recently, an evaluation metric was introduced: Squared Error Relevance Area (SERA)
This metric posits a bigger emphasis on the errors committed at extreme values while also accounting for the performance in the overall target variable domain.
arXiv Detail & Related papers (2022-06-20T20:23:56Z) - Primal Estimated Subgradient Solver for SVM for Imbalanced
Classification [0.0]
We aim to demonstrate that our cost sensitive PEGASOS SVM achieves good performance on imbalanced data sets with a Majority to Minority Ratio ranging from 8.6:1 to 130:1.
We evaluate the performance by examining the learning curves.
We benchmark our PEGASOS Cost-Sensitive SVM's results of Ding's LINEAR SVM DECIDL method.
arXiv Detail & Related papers (2022-06-19T02:33:14Z) - Compactness Score: A Fast Filter Method for Unsupervised Feature
Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features.
Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Hyperparameter-free Continuous Learning for Domain Classification in
Natural Language Understanding [60.226644697970116]
Domain classification is the fundamental task in natural language understanding (NLU)
Most existing continual learning approaches suffer from low accuracy and performance fluctuation.
We propose a hyper parameter-free continual learning model for text data that can stably produce high performance under various environments.
arXiv Detail & Related papers (2022-01-05T02:46:16Z) - VSAC: Efficient and Accurate Estimator for H and F [68.65610177368617]
VSAC is a RANSAC-type robust estimator with a number of novelties.
It is significantly faster than all its predecessors and runs on average in 1-2 ms, on a CPU.
It is two orders of magnitude faster and yet as precise as MAGSAC++, the currently most accurate estimator of two-view geometry.
arXiv Detail & Related papers (2021-06-18T17:04:57Z) - Bayesian Optimization with Missing Inputs [53.476096769837724]
We develop a new acquisition function based on the well-known Upper Confidence Bound (UCB) acquisition function.
We conduct comprehensive experiments on both synthetic and real-world applications to show the usefulness of our method.
arXiv Detail & Related papers (2020-06-19T03:56:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.