Trees, forests, and impurity-based variable importance
- URL: http://arxiv.org/abs/2001.04295v3
- Date: Fri, 24 Dec 2021 08:05:29 GMT
- Title: Trees, forests, and impurity-based variable importance
- Authors: Erwan Scornet (CMAP)
- Abstract summary: We analyze one of the two well-known random forest variable importances, the Mean Decrease Impurity (MDI)
We prove that if input variables are independent and in absence of interactions, MDI provides a variance decomposition of the output.
Our analysis shows that there may exist some benefits to use a forest compared to a single tree.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tree ensemble methods such as random forests [Breiman, 2001] are very popular
to handle high-dimensional tabular data sets, notably because of their good
predictive accuracy. However, when machine learning is used for decision-making
problems, settling for the best predictive procedures may not be reasonable
since enlightened decisions require an in-depth comprehension of the algorithm
prediction process. Unfortunately, random forests are not intrinsically
interpretable since their prediction results from averaging several hundreds of
decision trees. A classic approach to gain knowledge on this so-called
black-box algorithm is to compute variable importances, that are employed to
assess the predictive impact of each input variable. Variable importances are
then used to rank or select variables and thus play a great role in data
analysis. Nevertheless, there is no justification to use random forest variable
importances in such way: we do not even know what these quantities estimate. In
this paper, we analyze one of the two well-known random forest variable
importances, the Mean Decrease Impurity (MDI). We prove that if input variables
are independent and in absence of interactions, MDI provides a variance
decomposition of the output, where the contribution of each variable is clearly
identified. We also study models exhibiting dependence between input variables
or interaction, for which the variable importance is intrinsically ill-defined.
Our analysis shows that there may exist some benefits to use a forest compared
to a single tree.
Related papers
- Why do Random Forests Work? Understanding Tree Ensembles as
Self-Regularizing Adaptive Smoothers [68.76846801719095]
We argue that the current high-level dichotomy into bias- and variance-reduction prevalent in statistics is insufficient to understand tree ensembles.
We show that forests can improve upon trees by three distinct mechanisms that are usually implicitly entangled.
arXiv Detail & Related papers (2024-02-02T15:36:43Z) - MMD-based Variable Importance for Distributional Random Forest [5.0459880125089]
We introduce a variable importance algorithm for Distributional Random Forests (DRFs)
We show that the introduced importance measure is consistent, exhibits high empirical performance on both real and simulated data, and outperforms competitors.
arXiv Detail & Related papers (2023-10-18T17:12:29Z) - Interpretability at Scale: Identifying Causal Mechanisms in Alpaca [62.65877150123775]
We use Boundless DAS to efficiently search for interpretable causal structure in large language models while they follow instructions.
Our findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models.
arXiv Detail & Related papers (2023-05-15T17:15:40Z) - Posterior Collapse and Latent Variable Non-identifiability [54.842098835445]
We propose a class of latent-identifiable variational autoencoders, deep generative models which enforce identifiability without sacrificing flexibility.
Across synthetic and real datasets, latent-identifiable variational autoencoders outperform existing methods in mitigating posterior collapse and providing meaningful representations of the data.
arXiv Detail & Related papers (2023-01-02T06:16:56Z) - MURAL: An Unsupervised Random Forest-Based Embedding for Electronic
Health Record Data [59.26381272149325]
We present an unsupervised random forest for representing data with disparate variable types.
MURAL forests consist of a set of decision trees where node-splitting variables are chosen at random.
We show that using our approach, we can visualize and classify data more accurately than competing approaches.
arXiv Detail & Related papers (2021-11-19T22:02:21Z) - Trading Complexity for Sparsity in Random Forest Explanations [20.87501058448681]
We introduce majoritary reasons which are prime implicants of a strict majority of decision trees.
Experiments conducted on various datasets reveal the existence of a trade-off between runtime complexity and sparsity.
arXiv Detail & Related papers (2021-08-11T15:19:46Z) - Counterfactual Invariance to Spurious Correlations: Why and How to Pass
Stress Tests [87.60900567941428]
A spurious correlation' is the dependence of a model on some aspect of the input data that an analyst thinks shouldn't matter.
In machine learning, these have a know-it-when-you-see-it character.
We study stress testing using the tools of causal inference.
arXiv Detail & Related papers (2021-05-31T14:39:38Z) - Achieving Reliable Causal Inference with Data-Mined Variables: A Random
Forest Approach to the Measurement Error Problem [1.5749416770494704]
A common empirical strategy involves the application of predictive modeling techniques to'mine' variables of interest from available data.
Recent work highlights that, because the predictions from machine learning models are inevitably imperfect, econometric analyses based on the predicted variables are likely to suffer from bias due to measurement error.
We propose a novel approach to mitigate these biases, leveraging the ensemble learning technique known as the random forest.
arXiv Detail & Related papers (2020-12-19T21:48:23Z) - Stable Prediction via Leveraging Seed Variable [73.9770220107874]
Previous machine learning methods might exploit subtly spurious correlations in training data induced by non-causal variables for prediction.
We propose a conditional independence test based algorithm to separate causal variables with a seed variable as priori, and adopt them for stable prediction.
Our algorithm outperforms state-of-the-art methods for stable prediction.
arXiv Detail & Related papers (2020-06-09T06:56:31Z) - Fr\'echet random forests for metric space valued regression with non
euclidean predictors [0.0]
We introduce Fr'echet trees and Fr'echet random forests, which allow to handle data for which input and output variables take values in general metric spaces.
A consistency theorem for Fr'echet regressogram predictor using data-driven partitions is given and applied to Fr'echet purely uniformly random trees.
arXiv Detail & Related papers (2019-06-04T22:07:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.