Related papers: Effect of hyperparameters on variable selection in random forests

Effect of hyperparameters on variable selection in random forests

URL: http://arxiv.org/abs/2309.06943v2
Date: Sat, 25 Jan 2025 11:32:29 GMT
Title: Effect of hyperparameters on variable selection in random forests
Authors: Cesaire J. K. Fouodo, Lea L. Kronziel, Inke R. König, Silke Szymczak,
Abstract summary: We evaluate the effects on the Vita and Boruta variable selection procedures based on two simulation studies utilizing theoretical distributions and empirical gene expression data.<n>For weakly correlated predictor variables, the default value of the number of splitting variables is optimal, but smaller values of the sample fraction result in larger sensitivity.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Random forests (RFs) are well suited for prediction modeling and variable selection in high-dimensional omics studies. The effect of hyperparameters of the RF algorithm on prediction performance and variable importance estimation have previously been investigated. However, how hyperparameters impact RF-based variable selection remains unclear. We evaluate the effects on the Vita and the Boruta variable selection procedures based on two simulation studies utilizing theoretical distributions and empirical gene expression data. We assess the ability of the procedures to select important variables (sensitivity) while controlling the false discovery rate (FDR). Our results show that the proportion of splitting candidate variables and the sample fraction for the training dataset influence the selection procedures more than the drawing strategy of the training datasets and the minimal terminal node size. A suitable setting of the RF hyperparameters depends on the correlation structure in the data. For weakly correlated predictor variables, the default value of the number of splitting variables is optimal, but smaller values of the sample fraction result in larger sensitivity. In contrast, the difference in sensitivity of the optimal compared to the default value of sample fraction is negligible for strongly correlated predictor variables, whereas smaller values than the default are better in the other settings. In conclusion, the default values of the hyperparameters will not always be suitable for identifying important variables. Thus, adequate values differ depending on whether the aim of the study is optimizing prediction performance or variable selection.

Related papers

Regression-Based Estimation of Causal Effects in the Presence of Selection Bias and Confounding [52.1068936424622]
We consider the problem of estimating the expected causal effect $E[Y|do(X)]$ for a target variable $Y$ when treatment $X$ is set by intervention. In settings without selection bias or confounding, $E[Y|do(X)] = E[Y|X]$, which can be estimated using standard regression methods. We propose a framework that incorporates both selection bias and confounding.
arXiv Detail & Related papers (2025-03-26T13:43:37Z)
Knoop: Practical Enhancement of Knockoff with Over-Parameterization for Variable Selection [27.563529091471935]
This work introduces a novel approach namely Knockoff with over- parameterization (Knoop) to enhance variable selection. Knoop generates multiple knockoff variables for each original variable and integrates them with the original variables into a Ridgeless regression model. Experiments demonstrate superior performance compared to existing methods in both simulation and real-world datasets.
arXiv Detail & Related papers (2025-01-28T09:27:04Z)
Local Learning for Covariate Selection in Nonparametric Causal Effect Estimation with Latent Variables [15.105594376616253]
Estimating causal effects from nonexperimental data is a fundamental problem in many fields of science. We propose a novel local learning approach for covariate selection in nonparametric causal effect estimation. We validate our algorithm through extensive experiments on both synthetic and real-world data.
arXiv Detail & Related papers (2024-11-25T12:08:54Z)
Semiparametric conformal prediction [79.6147286161434]
Risk-sensitive applications require well-calibrated prediction sets over multiple, potentially correlated target variables. We treat the scores as random vectors and aim to construct the prediction set accounting for their joint correlation structure. We report desired coverage and competitive efficiency on a range of real-world regression problems.
arXiv Detail & Related papers (2024-11-04T14:29:02Z)
Model-independent variable selection via the rule-based variable priority [1.2771542695459488]
We introduce a new model-independent approach, Variable Priority (VarPro) VarPro works by utilizing rules without the need to generate artificial data or evaluate prediction error. We show that VarPro has a consistent filtering property for noise variables.
arXiv Detail & Related papers (2024-09-13T17:32:05Z)
Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values. We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO) Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z)
Winning Prize Comes from Losing Tickets: Improve Invariant Learning by Exploring Variant Parameters for Out-of-Distribution Generalization [76.27711056914168]
Out-of-Distribution (OOD) Generalization aims to learn robust models that generalize well to various environments without fitting to distribution-specific features. Recent studies based on Lottery Ticket Hypothesis (LTH) address this problem by minimizing the learning target to find some of the parameters that are critical to the task. We propose Exploring Variant parameters for Invariant Learning (EVIL) which also leverages the distribution knowledge to find the parameters that are sensitive to distribution shift.
arXiv Detail & Related papers (2023-10-25T06:10:57Z)
Quantifying predictive uncertainty of aphasia severity in stroke patients with sparse heteroscedastic Bayesian high-dimensional regression [47.1405366895538]
Sparse linear regression methods for high-dimensional data commonly assume that residuals have constant variance, which can be violated in practice. This paper proposes estimating high-dimensional heteroscedastic linear regression models using a heteroscedastic partitioned empirical Bayes Expectation Conditional Maximization algorithm.
arXiv Detail & Related papers (2023-09-15T22:06:29Z)
Opening the random forest black box by the analysis of the mutual impact of features [0.0]
We propose two novel approaches that focus on the mutual impact of features in random forests. MFI and MIR are very promising to shed light on the complex relationships between features and outcome.
arXiv Detail & Related papers (2023-04-05T15:03:46Z)
Adaptive Selection of the Optimal Strategy to Improve Precision and Power in Randomized Trials [2.048226951354646]
We show how to select the adjustment approach -- which variables and in which form -- to maximize precision. Our approach maintains Type-I error control (under the null) and offers substantial gains in precision. When applied to real data, we also see meaningful efficiency improvements overall and within subgroups.
arXiv Detail & Related papers (2022-10-31T16:25:38Z)
Sparse high-dimensional linear regression with a partitioned empirical Bayes ECM algorithm [62.997667081978825]
We propose a computationally efficient and powerful Bayesian approach for sparse high-dimensional linear regression. Minimal prior assumptions on the parameters are used through the use of plug-in empirical Bayes estimates. The proposed approach is implemented in the R package probe.
arXiv Detail & Related papers (2022-09-16T19:15:50Z)
Multivariate Probabilistic Regression with Natural Gradient Boosting [63.58097881421937]
We propose a Natural Gradient Boosting (NGBoost) approach based on nonparametrically modeling the conditional parameters of the multivariate predictive distribution. Our method is robust, works out-of-the-box without extensive tuning, is modular with respect to the assumed target distribution, and performs competitively in comparison to existing approaches.
arXiv Detail & Related papers (2021-06-07T17:44:49Z)
Variable selection with missing data in both covariates and outcomes: Imputation and machine learning [1.0333430439241666]
The missing data issue is ubiquitous in health studies. Machine learning methods weaken parametric assumptions. XGBoost and BART have the overall best performance across various settings.
arXiv Detail & Related papers (2021-04-06T20:18:29Z)
Sampling-free Variational Inference for Neural Networks with Multiplicative Activation Noise [51.080620762639434]
We propose a more efficient parameterization of the posterior approximation for sampling-free variational inference. Our approach yields competitive results for standard regression problems and scales well to large-scale image classification tasks.
arXiv Detail & Related papers (2021-03-15T16:16:18Z)
Increasing the efficiency of randomized trial estimates via linear adjustment for a prognostic score [59.75318183140857]
Estimating causal effects from randomized experiments is central to clinical research. Most methods for historical borrowing achieve reductions in variance by sacrificing strict type-I error rate control.
arXiv Detail & Related papers (2020-12-17T21:10:10Z)
Variational Variance: Simple, Reliable, Calibrated Heteroscedastic Noise Variance Parameterization [3.553493344868413]
We propose critiques to test predictive mean and variance calibration and the predictive distribution's ability to generate sensible data. We find that our solution, to treat heteroscedastic variance variationally, sufficiently regularizes variance to pass these PPCs.
arXiv Detail & Related papers (2020-06-08T19:58:35Z)
Hyperparameter Selection for Subsampling Bootstraps [0.0]
A subsampling method like BLB serves as a powerful tool for assessing the quality of estimators for massive data. The performance of the subsampling methods are highly influenced by the selection of tuning parameters. We develop a hyperparameter selection methodology, which can be used to select tuning parameters for subsampling methods. Both simulation studies and real data analysis demonstrate the superior advantage of our method.
arXiv Detail & Related papers (2020-06-02T17:10:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.