Effect of hyperparameters on variable selection in random forests
- URL: http://arxiv.org/abs/2309.06943v2
- Date: Sat, 25 Jan 2025 11:32:29 GMT
- Title: Effect of hyperparameters on variable selection in random forests
- Authors: Cesaire J. K. Fouodo, Lea L. Kronziel, Inke R. König, Silke Szymczak,
- Abstract summary: We evaluate the effects on the Vita and Boruta variable selection procedures based on two simulation studies utilizing theoretical distributions and empirical gene expression data.
For weakly correlated predictor variables, the default value of the number of splitting variables is optimal, but smaller values of the sample fraction result in larger sensitivity.
- Score: 0.0
- License:
- Abstract: Random forests (RFs) are well suited for prediction modeling and variable selection in high-dimensional omics studies. The effect of hyperparameters of the RF algorithm on prediction performance and variable importance estimation have previously been investigated. However, how hyperparameters impact RF-based variable selection remains unclear. We evaluate the effects on the Vita and the Boruta variable selection procedures based on two simulation studies utilizing theoretical distributions and empirical gene expression data. We assess the ability of the procedures to select important variables (sensitivity) while controlling the false discovery rate (FDR). Our results show that the proportion of splitting candidate variables and the sample fraction for the training dataset influence the selection procedures more than the drawing strategy of the training datasets and the minimal terminal node size. A suitable setting of the RF hyperparameters depends on the correlation structure in the data. For weakly correlated predictor variables, the default value of the number of splitting variables is optimal, but smaller values of the sample fraction result in larger sensitivity. In contrast, the difference in sensitivity of the optimal compared to the default value of sample fraction is negligible for strongly correlated predictor variables, whereas smaller values than the default are better in the other settings. In conclusion, the default values of the hyperparameters will not always be suitable for identifying important variables. Thus, adequate values differ depending on whether the aim of the study is optimizing prediction performance or variable selection.
Related papers
- Knoop: Practical Enhancement of Knockoff with Over-Parameterization for Variable Selection [27.563529091471935]
This work introduces a novel approach namely Knockoff with over- parameterization (Knoop) to enhance variable selection.
Knoop generates multiple knockoff variables for each original variable and integrates them with the original variables into a Ridgeless regression model.
Experiments demonstrate superior performance compared to existing methods in both simulation and real-world datasets.
arXiv Detail & Related papers (2025-01-28T09:27:04Z) - Local Learning for Covariate Selection in Nonparametric Causal Effect Estimation with Latent Variables [15.105594376616253]
Estimating causal effects from nonexperimental data is a fundamental problem in many fields of science.
We propose a novel local learning approach for covariate selection in nonparametric causal effect estimation.
We validate our algorithm through extensive experiments on both synthetic and real-world data.
arXiv Detail & Related papers (2024-11-25T12:08:54Z) - Semiparametric conformal prediction [79.6147286161434]
Risk-sensitive applications require well-calibrated prediction sets over multiple, potentially correlated target variables.
We treat the scores as random vectors and aim to construct the prediction set accounting for their joint correlation structure.
We report desired coverage and competitive efficiency on a range of real-world regression problems.
arXiv Detail & Related papers (2024-11-04T14:29:02Z) - Model-independent variable selection via the rule-based variable priority [1.2771542695459488]
We introduce a new model-independent approach, Variable Priority (VarPro)
VarPro works by utilizing rules without the need to generate artificial data or evaluate prediction error.
We show that VarPro has a consistent filtering property for noise variables.
arXiv Detail & Related papers (2024-09-13T17:32:05Z) - Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values.
We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO)
Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z) - Winning Prize Comes from Losing Tickets: Improve Invariant Learning by
Exploring Variant Parameters for Out-of-Distribution Generalization [76.27711056914168]
Out-of-Distribution (OOD) Generalization aims to learn robust models that generalize well to various environments without fitting to distribution-specific features.
Recent studies based on Lottery Ticket Hypothesis (LTH) address this problem by minimizing the learning target to find some of the parameters that are critical to the task.
We propose Exploring Variant parameters for Invariant Learning (EVIL) which also leverages the distribution knowledge to find the parameters that are sensitive to distribution shift.
arXiv Detail & Related papers (2023-10-25T06:10:57Z) - Opening the random forest black box by the analysis of the mutual impact
of features [0.0]
We propose two novel approaches that focus on the mutual impact of features in random forests.
MFI and MIR are very promising to shed light on the complex relationships between features and outcome.
arXiv Detail & Related papers (2023-04-05T15:03:46Z) - Sparse high-dimensional linear regression with a partitioned empirical
Bayes ECM algorithm [62.997667081978825]
We propose a computationally efficient and powerful Bayesian approach for sparse high-dimensional linear regression.
Minimal prior assumptions on the parameters are used through the use of plug-in empirical Bayes estimates.
The proposed approach is implemented in the R package probe.
arXiv Detail & Related papers (2022-09-16T19:15:50Z) - Sampling-free Variational Inference for Neural Networks with
Multiplicative Activation Noise [51.080620762639434]
We propose a more efficient parameterization of the posterior approximation for sampling-free variational inference.
Our approach yields competitive results for standard regression problems and scales well to large-scale image classification tasks.
arXiv Detail & Related papers (2021-03-15T16:16:18Z) - Variable Skipping for Autoregressive Range Density Estimation [84.60428050170687]
We show a technique, variable skipping, for accelerating range density estimation over deep autoregressive models.
We show that variable skipping provides 10-100$times$ efficiency improvements when targeting challenging high-quantile error metrics.
arXiv Detail & Related papers (2020-07-10T19:01:40Z) - Variational Variance: Simple, Reliable, Calibrated Heteroscedastic Noise
Variance Parameterization [3.553493344868413]
We propose critiques to test predictive mean and variance calibration and the predictive distribution's ability to generate sensible data.
We find that our solution, to treat heteroscedastic variance variationally, sufficiently regularizes variance to pass these PPCs.
arXiv Detail & Related papers (2020-06-08T19:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.