Local MDI+: Local Feature Importances for Tree-Based Models
- URL: http://arxiv.org/abs/2506.08928v1
- Date: Tue, 10 Jun 2025 15:51:27 GMT
- Title: Local MDI+: Local Feature Importances for Tree-Based Models
- Authors: Zhongyuan Liang, Zachary T. Rewolinski, Abhineet Agarwal, Tiffany M. Tang, Bin Yu,
- Abstract summary: Local MDI+ (LMDI+) is a novel extension of the MDI+ framework to the sample specific setting.<n>It produces similar instance-level feature importance rankings across multiple random forest fits.<n>It also enables local interpretability use cases, including the identification of closer counterfactuals.
- Score: 8.532396185972392
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tree-based ensembles such as random forests remain the go-to for tabular data over deep learning models due to their prediction performance and computational efficiency. These advantages have led to their widespread deployment in high-stakes domains, where interpretability is essential for ensuring trustworthy predictions. This has motivated the development of popular local (i.e. sample-specific) feature importance (LFI) methods such as LIME and TreeSHAP. However, these approaches rely on approximations that ignore the model's internal structure and instead depend on potentially unstable perturbations. These issues are addressed in the global setting by MDI+, a feature importance method which exploits an equivalence between decision trees and linear models on a transformed node basis. However, the global MDI+ scores are not able to explain predictions when faced with heterogeneous individual characteristics. To address this gap, we propose Local MDI+ (LMDI+), a novel extension of the MDI+ framework to the sample specific setting. LMDI+ outperforms existing baselines LIME and TreeSHAP in identifying instance-specific signal features, averaging a 10% improvement in downstream task performance across twelve real-world benchmark datasets. It further demonstrates greater stability by consistently producing similar instance-level feature importance rankings across multiple random forest fits. Finally, LMDI+ enables local interpretability use cases, including the identification of closer counterfactuals and the discovery of homogeneous subgroups.
Related papers
- Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning [77.120955854093]
We show that data diversity can be a strong predictor of generalization in language models.<n>We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients.<n>We present Prismatic Synthesis, a framework for generating diverse synthetic data.
arXiv Detail & Related papers (2025-05-26T16:05:10Z) - SSPNet: Scale and Spatial Priors Guided Generalizable and Interpretable
Pedestrian Attribute Recognition [23.55622798950833]
A novel Scale and Spatial Priors Guided Network (SSPNet) is proposed for Pedestrian Attribute Recognition (PAR) models.
SSPNet learns to provide reasonable scale prior information for different attribute groups, allowing the model to focus on different levels of feature maps.
A novel IoU based attribute localization metric is proposed for Weakly-supervised Pedestrian Attribute localization (WPAL) based on the improved Grad-CAM for attribute response mask.
arXiv Detail & Related papers (2023-12-11T00:41:40Z) - Integrating Random Forests and Generalized Linear Models for Improved Accuracy and Interpretability [9.128252505139471]
We use a framework called RF+ to combine strengths of RFs and generalized linear models.<n>We show that RF+ improves prediction accuracy over RFs and that MDI+ outperforms popular feature importance measures in identifying signal features.
arXiv Detail & Related papers (2023-07-04T21:36:46Z) - Synthetic-to-Real Domain Generalized Semantic Segmentation for 3D Indoor
Point Clouds [69.64240235315864]
This paper introduces the synthetic-to-real domain generalization setting to this task.
The domain gap between synthetic and real-world point cloud data mainly lies in the different layouts and point patterns.
Experiments on the synthetic-to-real benchmark demonstrate that both CINMix and multi-prototypes can narrow the distribution gap.
arXiv Detail & Related papers (2022-12-09T05:07:43Z) - Divide and Contrast: Source-free Domain Adaptation via Adaptive
Contrastive Learning [122.62311703151215]
Divide and Contrast (DaC) aims to connect the good ends of both worlds while bypassing their limitations.
DaC divides the target data into source-like and target-specific samples, where either group of samples is treated with tailored goals.
We further align the source-like domain with the target-specific samples using a memory bank-based Maximum Mean Discrepancy (MMD) loss to reduce the distribution mismatch.
arXiv Detail & Related papers (2022-11-12T09:21:49Z) - Federated and Generalized Person Re-identification through Domain and
Feature Hallucinating [88.77196261300699]
We study the problem of federated domain generalization (FedDG) for person re-identification (re-ID)
We propose a novel method, called "Domain and Feature Hallucinating (DFH)", to produce diverse features for learning generalized local and global models.
Our method achieves the state-of-the-art performance for FedDG on four large-scale re-ID benchmarks.
arXiv Detail & Related papers (2022-03-05T09:15:13Z) - From global to local MDI variable importances for random forests and
when they are Shapley values [9.99125500568217]
We first show that the global Mean Decrease of Impurity (MDI) variable importance scores correspond to Shapley values under some conditions.
We derive a local MDI importance measure of variable relevance, which has a very natural connection with the global MDI measure and can be related to a new notion of local feature relevance.
arXiv Detail & Related papers (2021-11-03T13:38:41Z) - Data-driven advice for interpreting local and global model predictions
in bioinformatics problems [17.685881417954782]
Conditional feature contributions (CFCs) provide textitlocal, case-by-case explanations of a prediction.
We compare the explanations computed by both methods on a set of 164 publicly available classification problems.
For random forests, we find extremely high similarities and correlations of both local and global SHAP values and CFC scores.
arXiv Detail & Related papers (2021-08-13T12:41:39Z) - Joint Distribution across Representation Space for Out-of-Distribution
Detection [16.96466730536722]
We present a novel outlook on in-distribution data in a generative manner, which takes their latent features generated from each hidden layer as a joint distribution across representation spaces.
We first construct the Gaussian Mixture Model (GMM) based on in-distribution latent features for each hidden layer, and then connect GMMs via the transition probabilities of the inference traces.
arXiv Detail & Related papers (2021-03-23T06:39:29Z) - WILDS: A Benchmark of in-the-Wild Distribution Shifts [157.53410583509924]
Distribution shifts can substantially degrade the accuracy of machine learning systems deployed in the wild.
We present WILDS, a curated collection of 8 benchmark datasets that reflect a diverse range of distribution shifts.
We show that standard training results in substantially lower out-of-distribution than in-distribution performance.
arXiv Detail & Related papers (2020-12-14T11:14:56Z) - Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction.
We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data.
Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.