Opening the random forest black box by the analysis of the mutual impact
of features
- URL: http://arxiv.org/abs/2304.02490v1
- Date: Wed, 5 Apr 2023 15:03:46 GMT
- Title: Opening the random forest black box by the analysis of the mutual impact
of features
- Authors: Lucas F. Voges, Lukas C. Jarren, Stephan Seifert
- Abstract summary: We propose two novel approaches that focus on the mutual impact of features in random forests.
MFI and MIR are very promising to shed light on the complex relationships between features and outcome.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Random forest is a popular machine learning approach for the analysis of
high-dimensional data because it is flexible and provides variable importance
measures for the selection of relevant features. However, the complex
relationships between the features are usually not considered for the selection
and thus also neglected for the characterization of the analysed samples. Here
we propose two novel approaches that focus on the mutual impact of features in
random forests. Mutual forest impact (MFI) is a relation parameter that
evaluates the mutual association of the featurs to the outcome and, hence, goes
beyond the analysis of correlation coefficients. Mutual impurity reduction
(MIR) is an importance measure that combines this relation parameter with the
importance of the individual features. MIR and MFI are implemented together
with testing procedures that generate p-values for the selection of related and
important features. Applications to various simulated data sets and the
comparison to other methods for feature selection and relation analysis show
that MFI and MIR are very promising to shed light on the complex relationships
between features and outcome. In addition, they are not affected by common
biases, e.g. that features with many possible splits or high minor allele
frequencies are prefered.
Related papers
- Challenges in Variable Importance Ranking Under Correlation [6.718144470265263]
We present a comprehensive simulation study investigating the impact of feature correlation on the assessment of variable importance.
While there is always no correlation between knockoff variables and its corresponding predictor variables, we prove that the correlation increases linearly beyond a certain correlation threshold between the predictor variables.
arXiv Detail & Related papers (2024-02-05T19:02:13Z) - Causal Feature Selection via Transfer Entropy [59.999594949050596]
Causal discovery aims to identify causal relationships between features with observational data.
We introduce a new causal feature selection approach that relies on the forward and backward feature selection procedures.
We provide theoretical guarantees on the regression and classification errors for both the exact and the finite-sample cases.
arXiv Detail & Related papers (2023-10-17T08:04:45Z) - On the Properties and Estimation of Pointwise Mutual Information Profiles [49.877314063833296]
The pointwise mutual information profile, or simply profile, is the distribution of pointwise mutual information for a given pair of random variables.
We introduce a novel family of distributions, Bend and Mix Models, for which the profile can be accurately estimated using Monte Carlo methods.
arXiv Detail & Related papers (2023-10-16T10:02:24Z) - A Notion of Feature Importance by Decorrelation and Detection of Trends
by Random Forest Regression [1.675857332621569]
We introduce a novel notion of feature importance based on the well-studied Gram-Schmidt decorrelation method.
We propose two estimators for identifying trends in the data using random forest regression.
arXiv Detail & Related papers (2023-03-02T11:01:49Z) - Data-Driven Influence Functions for Optimization-Based Causal Inference [105.5385525290466]
We study a constructive algorithm that approximates Gateaux derivatives for statistical functionals by finite differencing.
We study the case where probability distributions are not known a priori but need to be estimated from data.
arXiv Detail & Related papers (2022-08-29T16:16:22Z) - Decorrelate Irrelevant, Purify Relevant: Overcome Textual Spurious
Correlations from a Feature Perspective [47.10907370311025]
Natural language understanding (NLU) models tend to rely on spurious correlations (emphi.e., dataset bias) to achieve high performance on in-distribution datasets but poor performance on out-of-distribution ones.
Most of the existing debiasing methods often identify and weaken these samples with biased features.
Down-weighting these samples obstructs the model in learning from the non-biased parts of these samples.
We propose to eliminate spurious correlations in a fine-grained manner from a feature space perspective.
arXiv Detail & Related papers (2022-02-16T13:23:14Z) - Evaluating Sensitivity to the Stick-Breaking Prior in Bayesian
Nonparametrics [85.31247588089686]
We show that variational Bayesian methods can yield sensitivities with respect to parametric and nonparametric aspects of Bayesian models.
We provide both theoretical and empirical support for our variational approach to Bayesian sensitivity analysis.
arXiv Detail & Related papers (2021-07-08T03:40:18Z) - Factorization Machines with Regularization for Sparse Feature
Interactions [13.593781209611112]
Factorization machines (FMs) are machine learning predictive models based on second-order feature interactions.
We present a new regularization scheme for feature interaction selection in FMs.
For feature interaction selection, our proposed regularizer makes the feature interaction matrix sparse without a restriction on sparsity patterns imposed by the existing methods.
arXiv Detail & Related papers (2020-10-19T05:00:40Z) - Out-of-distribution Generalization via Partial Feature Decorrelation [72.96261704851683]
We present a novel Partial Feature Decorrelation Learning (PFDL) algorithm, which jointly optimize a feature decomposition network and the target image classification model.
The experiments on real-world datasets demonstrate that our method can improve the backbone model's accuracy on OOD image classification datasets.
arXiv Detail & Related papers (2020-07-30T05:48:48Z) - TCMI: a non-parametric mutual-dependence estimator for multivariate
continuous distributions [0.0]
Total cumulative mutual information (TCMI) is a measure of the relevance of mutual dependences.
TCMI is a non-parametric, robust, and deterministic measure that facilitates comparisons and rankings between feature sets.
arXiv Detail & Related papers (2020-01-30T08:42:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.