MMD-based Variable Importance for Distributional Random Forest
- URL: http://arxiv.org/abs/2310.12115v2
- Date: Wed, 14 Feb 2024 13:56:50 GMT
- Title: MMD-based Variable Importance for Distributional Random Forest
- Authors: Cl\'ement B\'enard and Jeffrey N\"af and Julie Josse
- Abstract summary: We introduce a variable importance algorithm for Distributional Random Forests (DRFs)
We show that the introduced importance measure is consistent, exhibits high empirical performance on both real and simulated data, and outperforms competitors.
- Score: 5.0459880125089
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distributional Random Forest (DRF) is a flexible forest-based method to
estimate the full conditional distribution of a multivariate output of interest
given input variables. In this article, we introduce a variable importance
algorithm for DRFs, based on the well-established drop and relearn principle
and MMD distance. While traditional importance measures only detect variables
with an influence on the output mean, our algorithm detects variables impacting
the output distribution more generally. We show that the introduced importance
measure is consistent, exhibits high empirical performance on both real and
simulated data, and outperforms competitors. In particular, our algorithm is
highly efficient to select variables through recursive feature elimination, and
can therefore provide small sets of variables to build accurate estimates of
conditional output distributions.
Related papers
- Efficient Distribution Matching of Representations via Noise-Injected Deep InfoMax [73.03684002513218]
We enhance Deep InfoMax (DIM) to enable automatic matching of learned representations to a selected prior distribution.
We show that such modification allows for learning uniformly and normally distributed representations.
The results indicate a moderate trade-off between the performance on the downstream tasks and quality of DM.
arXiv Detail & Related papers (2024-10-09T15:40:04Z) - Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation [62.2436697657307]
Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data.
We propose a method called Stratified Prediction-Powered Inference (StratPPI)
We show that the basic PPI estimates can be considerably improved by employing simple data stratification strategies.
arXiv Detail & Related papers (2024-06-06T17:37:39Z) - Embedding Trajectory for Out-of-Distribution Detection in Mathematical Reasoning [50.84938730450622]
We propose a trajectory-based method TV score, which uses trajectory volatility for OOD detection in mathematical reasoning.
Our method outperforms all traditional algorithms on GLMs under mathematical reasoning scenarios.
Our method can be extended to more applications with high-density features in output spaces, such as multiple-choice questions.
arXiv Detail & Related papers (2024-05-22T22:22:25Z) - Multiple Hypothesis Dropout: Estimating the Parameters of Multi-Modal
Output Distributions [22.431244647796582]
This paper presents a Mixture of Multiple-Output functions (MoM) approach using a novel variant of dropout, Multiple Hypothesis Dropout.
Experiments on supervised learning problems illustrate that our approach outperforms existing solutions for reconstructing multimodal output distributions.
Additional studies on unsupervised learning problems show that estimating the parameters of latent posterior distributions within a discrete autoencoder significantly improves codebook efficiency, sample quality, precision and recall.
arXiv Detail & Related papers (2023-12-18T22:20:11Z) - DIVERSIFY: A General Framework for Time Series Out-of-distribution
Detection and Generalization [58.704753031608625]
Time series is one of the most challenging modalities in machine learning research.
OOD detection and generalization on time series tend to suffer due to its non-stationary property.
We propose DIVERSIFY, a framework for OOD detection and generalization on dynamic distributions of time series.
arXiv Detail & Related papers (2023-08-04T12:27:11Z) - Label Shift Quantification with Robustness Guarantees via Distribution
Feature Matching [3.2013172123155615]
We first present a unifying framework, distribution feature matching (DFM), that recovers as particular instances various estimators introduced in previous literature.
We then extend this analysis to study robustness of DFM procedures in the misspecified setting under departure from the exact label shift hypothesis.
These theoretical findings are confirmed by a detailed numerical study on simulated and real-world datasets.
arXiv Detail & Related papers (2023-06-07T12:17:34Z) - Efficient CDF Approximations for Normalizing Flows [64.60846767084877]
We build upon the diffeomorphic properties of normalizing flows to estimate the cumulative distribution function (CDF) over a closed region.
Our experiments on popular flow architectures and UCI datasets show a marked improvement in sample efficiency as compared to traditional estimators.
arXiv Detail & Related papers (2022-02-23T06:11:49Z) - Trustworthy Multimodal Regression with Mixture of Normal-inverse Gamma
Distributions [91.63716984911278]
We introduce a novel Mixture of Normal-Inverse Gamma distributions (MoNIG) algorithm, which efficiently estimates uncertainty in principle for adaptive integration of different modalities and produces a trustworthy regression result.
Experimental results on both synthetic and different real-world data demonstrate the effectiveness and trustworthiness of our method on various multimodal regression tasks.
arXiv Detail & Related papers (2021-11-11T14:28:12Z) - Probabilistic Kolmogorov-Arnold Network [1.4732811715354455]
The present paper proposes a method for estimating probability distributions of the outputs in the case of aleatoric uncertainty.
The suggested approach covers input-dependent probability distributions of the outputs, as well as the variation of the distribution type with the inputs.
Although the method is applicable to any regression model, the present paper combines it with KANs, since the specific structure of KANs leads to computationally-efficient models' construction.
arXiv Detail & Related papers (2021-04-04T23:49:15Z) - Distributional Random Forests: Heterogeneity Adjustment and Multivariate
Distributional Regression [0.8574682463936005]
We propose a novel forest construction for multivariate responses based on their joint conditional distribution.
The code is available as Python and R packages drf.
arXiv Detail & Related papers (2020-05-29T09:05:00Z) - Trees, forests, and impurity-based variable importance [0.0]
We analyze one of the two well-known random forest variable importances, the Mean Decrease Impurity (MDI)
We prove that if input variables are independent and in absence of interactions, MDI provides a variance decomposition of the output.
Our analysis shows that there may exist some benefits to use a forest compared to a single tree.
arXiv Detail & Related papers (2020-01-13T14:38:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.