Related papers: MDA for random forests: inconsistency, and a practical solution via the Sobol-MDA

MDA for random forests: inconsistency, and a practical solution via the Sobol-MDA

URL: http://arxiv.org/abs/2102.13347v1
Date: Fri, 26 Feb 2021 07:53:39 GMT
Title: MDA for random forests: inconsistency, and a practical solution via the Sobol-MDA
Authors: Cl\'ement B\'enard (LPSM), S\'ebastien da Veiga, Erwan Scornet (CMAP)
Abstract summary: Mean Decrease Accuracy (MDA) is widely accepted as the most efficient variable importance measure for random forests. We mathematically formalize the various implemented MDA algorithms, and then establish their limits when the sample size increases. We prove the consistency of the Sobol-MDA and show its good empirical performance through experiments on both simulated and real data.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Variable importance measures are the main tools to analyze the black-box mechanism of random forests. Although the Mean Decrease Accuracy (MDA) is widely accepted as the most efficient variable importance measure for random forests, little is known about its theoretical properties. In fact, the exact MDA definition varies across the main random forest software. In this article, our objective is to rigorously analyze the behavior of the main MDA implementations. Consequently, we mathematically formalize the various implemented MDA algorithms, and then establish their limits when the sample size increases. In particular, we break down these limits in three components: the first two are related to Sobol indices, which are well-defined measures of a variable contribution to the output variance, widely used in the sensitivity analysis field, as opposed to the third term, whose value increases with dependence within input variables. Thus, we theoretically demonstrate that the MDA does not target the right quantity when inputs are dependent, a fact that has already been noticed experimentally. To address this issue, we define a new importance measure for random forests, the Sobol-MDA, which fixes the flaws of the original MDA. We prove the consistency of the Sobol-MDA and show its good empirical performance through experiments on both simulated and real data. An open source implementation in R and C++ is available online.

Related papers

MMD-based Variable Importance for Distributional Random Forest [5.0459880125089]
We introduce a variable importance algorithm for Distributional Random Forests (DRFs) We show that the introduced importance measure is consistent, exhibits high empirical performance on both real and simulated data, and outperforms competitors.
arXiv Detail & Related papers (2023-10-18T17:12:29Z)
DCID: Deep Canonical Information Decomposition [84.59396326810085]
We consider the problem of identifying the signal shared between two one-dimensional target variables. We propose ICM, an evaluation metric which can be used in the presence of ground-truth labels. We also propose Deep Canonical Information Decomposition (DCID) - a simple, yet effective approach for learning the shared variables.
arXiv Detail & Related papers (2023-06-27T16:59:06Z)
Algorithm-Dependent Bounds for Representation Learning of Multi-Source Domain Adaptation [7.6249291891777915]
We use information-theoretic tools to derive a novel analysis of Multi-source Domain Adaptation (MDA) from the representation learning perspective. We propose a novel deep MDA algorithm, implicitly addressing the target shift through joint alignment. The proposed algorithm has comparable performance to the state-of-the-art on target-shifted MDA benchmark with improved memory efficiency.
arXiv Detail & Related papers (2023-04-04T18:32:20Z)
Estimation-of-Distribution Algorithms for Multi-Valued Decision Variables [10.165640083594573]
We extend the known quantitative analysis of genetic drift to estimation-of-distribution algorithms for multi-valued variables. Our work shows that our good understanding of binary EDAs naturally extends to the multi-valued setting.
arXiv Detail & Related papers (2023-02-28T08:52:40Z)
On the Variance of the Fisher Information for Deep Learning [79.71410479830222]
The Fisher information matrix (FIM) has been applied to the realm of deep learning. The exact FIM is either unavailable in closed form or too expensive to compute. We investigate two such estimators based on two equivalent representations of the FIM.
arXiv Detail & Related papers (2021-07-09T04:46:50Z)
Rethink Maximum Mean Discrepancy for Domain Adaptation [77.2560592127872]
This paper theoretically proves two essential facts: 1) minimizing the Maximum Mean Discrepancy equals to maximize the source and target intra-class distances respectively but jointly minimize their variance with some implicit weights, so that the feature discriminability degrades. Experiments on several benchmark datasets not only prove the validity of theoretical results but also demonstrate that our approach could perform better than the comparative state-of-art methods substantially.
arXiv Detail & Related papers (2020-07-01T18:25:10Z)
Rethinking Distributional Matching Based Domain Adaptation [111.15106414932413]
Domain adaptation (DA) is a technique that transfers predictive models trained on a labeled source domain to an unlabeled target domain. Most popular DA algorithms are based on distributional matching (DM) In this paper, we first systematically analyze the limitations of DM based methods, and then build new benchmarks with more realistic domain shifts.
arXiv Detail & Related papers (2020-06-23T21:55:14Z)
Neural Methods for Point-wise Dependency Estimation [129.93860669802046]
We focus on estimating point-wise dependency (PD), which quantitatively measures how likely two outcomes co-occur. We demonstrate the effectiveness of our approaches in 1) MI estimation, 2) self-supervised representation learning, and 3) cross-modal retrieval task.
arXiv Detail & Related papers (2020-06-09T23:26:15Z)
Multi-source Domain Adaptation in the Deep Learning Era: A Systematic Survey [53.656086832255944]
Multi-source domain adaptation (MDA) is a powerful extension in which the labeled data may be collected from multiple sources. MDA has attracted increasing attention in both academia and industry.
arXiv Detail & Related papers (2020-02-26T08:07:58Z)
Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability Using Clustering Techniques on Replicated Runs [0.3499870393443268]
We study the stability of LDA by comparing assignments from replicated runs. We propose to quantify the similarity of two generated topics by a modified Jaccard coefficient. We show that the measure S-CLOP is useful for assessing the stability of LDA models.
arXiv Detail & Related papers (2020-02-14T07:10:18Z)
Trees, forests, and impurity-based variable importance [0.0]
We analyze one of the two well-known random forest variable importances, the Mean Decrease Impurity (MDI) We prove that if input variables are independent and in absence of interactions, MDI provides a variance decomposition of the output. Our analysis shows that there may exist some benefits to use a forest compared to a single tree.
arXiv Detail & Related papers (2020-01-13T14:38:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.