Related papers: A Computational Exploration of Emerging Methods of Variable Importance Estimation

A Computational Exploration of Emerging Methods of Variable Importance Estimation

URL: http://arxiv.org/abs/2208.03373v1
Date: Fri, 5 Aug 2022 20:00:56 GMT
Title: A Computational Exploration of Emerging Methods of Variable Importance Estimation
Authors: Louis Mozart Kamdem and Ernest Fokoue
Abstract summary: Estimating the importance of variables is an essential task in modern machine learning. We propose a computational and theoretical exploration of the emerging methods of variable importance estimation. The implementation has shown that PERF has the best performance in the case of highly correlated data.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Estimating the importance of variables is an essential task in modern machine learning. This help to evaluate the goodness of a feature in a given model. Several techniques for estimating the importance of variables have been developed during the last decade. In this paper, we proposed a computational and theoretical exploration of the emerging methods of variable importance estimation, namely: Least Absolute Shrinkage and Selection Operator (LASSO), Support Vector Machine (SVM), the Predictive Error Function (PERF), Random Forest (RF), and Extreme Gradient Boosting (XGBOOST) that were tested on different kinds of real-life and simulated data. All these methods can handle both regression and classification tasks seamlessly but all fail when it comes to dealing with data containing missing values. The implementation has shown that PERF has the best performance in the case of highly correlated data closely followed by RF. PERF and XGBOOST are "data-hungry" methods, they had the worst performance on small data sizes but they are the fastest when it comes to the execution time. SVM is the most appropriate when many redundant features are in the dataset. A surplus with the PERF is its natural cut-off at zero helping to separate positive and negative scores with all positive scores indicating essential and significant features while the negatives score indicates useless features. RF and LASSO are very versatile in a way that they can be used in almost all situations despite they are not giving the best results.

Related papers

Rescaled Influence Functions: Accurate Data Attribution in High Dimension [6.812390750464419]
We present rescaled influence functions (RIF), a new tool for data attribution which can be used as a drop-in replacement for influence functions.<n>We compare IF and RIF on a range of real-world datasets, showing that RIFs offer significantly better predictions in practice.
arXiv Detail & Related papers (2025-06-07T04:19:21Z)
DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets. Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
A Bio-Medical Snake Optimizer System Driven by Logarithmic Surviving Global Search for Optimizing Feature Selection and its application for Disorder Recognition [1.3755153408022656]
It is paramount to enhance medical practices, given how important it is to protect human life. Medical therapy can be accelerated by automating patient prediction using machine learning techniques. Several preprocessing strategies must be adopted for their crucial duty in this field.
arXiv Detail & Related papers (2024-02-22T09:08:18Z)
Equation Discovery with Bayesian Spike-and-Slab Priors and Efficient Kernels [57.46832672991433]
We propose a novel equation discovery method based on Kernel learning and BAyesian Spike-and-Slab priors (KBASS) We use kernel regression to estimate the target function, which is flexible, expressive, and more robust to data sparsity and noises. We develop an expectation-propagation expectation-maximization algorithm for efficient posterior inference and function estimation.
arXiv Detail & Related papers (2023-10-09T03:55:09Z)
DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models [31.65198592956842]
We propose DataInf, an efficient influence approximation method that is practical for large-scale generative AI models. Our theoretical analysis shows that DataInf is particularly well-suited for parameter-efficient fine-tuning techniques such as LoRA. In applications to RoBERTa-large, Llama-2-13B-chat, and stable-diffusion-v1.5 models, DataInf effectively identifies the most influential fine-tuning examples better than other approximate influence scores.
arXiv Detail & Related papers (2023-10-02T04:59:19Z)
FAStEN: An Efficient Adaptive Method for Feature Selection and Estimation in High-Dimensional Functional Regressions [7.674715791336311]
We propose a new, flexible and ultra-efficient approach to perform feature selection in a sparse function-on-function regression problem. We show how to extend it to the scalar-on-function framework. We present an application to brain fMRI data from the AOMIC PIOP1 study.
arXiv Detail & Related papers (2023-03-26T19:41:17Z)
Model Optimization in Imbalanced Regression [2.580765958706854]
Imbalanced domain learning aims to produce accurate models in predicting instances that, though underrepresented, are of utmost importance for the domain. One of the main reasons for this is the lack of loss functions capable of focusing on minimizing the errors of extreme (rare) values. Recently, an evaluation metric was introduced: Squared Error Relevance Area (SERA) This metric posits a bigger emphasis on the errors committed at extreme values while also accounting for the performance in the overall target variable domain.
arXiv Detail & Related papers (2022-06-20T20:23:56Z)
Primal Estimated Subgradient Solver for SVM for Imbalanced Classification [0.0]
We aim to demonstrate that our cost sensitive PEGASOS SVM achieves good performance on imbalanced data sets with a Majority to Minority Ratio ranging from 8.6:1 to 130:1. We evaluate the performance by examining the learning curves. We benchmark our PEGASOS Cost-Sensitive SVM's results of Ding's LINEAR SVM DECIDL method.
arXiv Detail & Related papers (2022-06-19T02:33:14Z)
Compactness Score: A Fast Filter Method for Unsupervised Feature Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features. Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
Hyperparameter-free Continuous Learning for Domain Classification in Natural Language Understanding [60.226644697970116]
Domain classification is the fundamental task in natural language understanding (NLU) Most existing continual learning approaches suffer from low accuracy and performance fluctuation. We propose a hyper parameter-free continual learning model for text data that can stably produce high performance under various environments.
arXiv Detail & Related papers (2022-01-05T02:46:16Z)
VSAC: Efficient and Accurate Estimator for H and F [68.65610177368617]
VSAC is a RANSAC-type robust estimator with a number of novelties. It is significantly faster than all its predecessors and runs on average in 1-2 ms, on a CPU. It is two orders of magnitude faster and yet as precise as MAGSAC++, the currently most accurate estimator of two-view geometry.
arXiv Detail & Related papers (2021-06-18T17:04:57Z)
Bayesian Optimization with Missing Inputs [53.476096769837724]
We develop a new acquisition function based on the well-known Upper Confidence Bound (UCB) acquisition function. We conduct comprehensive experiments on both synthetic and real-world applications to show the usefulness of our method.
arXiv Detail & Related papers (2020-06-19T03:56:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.