Conditional Feature Importance for Mixed Data
- URL: http://arxiv.org/abs/2210.03047v3
- Date: Tue, 2 May 2023 08:41:03 GMT
- Title: Conditional Feature Importance for Mixed Data
- Authors: Kristin Blesch, David S. Watson, Marvin N. Wright
- Abstract summary: We develop a conditional predictive impact (CPI) framework with knockoff sampling.
We show that our proposed workflow controls type I error, achieves high power and is in line with results given by other conditional FI measures.
Our findings highlight the necessity of developing statistically adequate, specialized methods for mixed data.
- Score: 1.6114012813668934
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the popularity of feature importance (FI) measures in interpretable
machine learning, the statistical adequacy of these methods is rarely
discussed. From a statistical perspective, a major distinction is between
analyzing a variable's importance before and after adjusting for covariates -
i.e., between $\textit{marginal}$ and $\textit{conditional}$ measures. Our work
draws attention to this rarely acknowledged, yet crucial distinction and
showcases its implications. Further, we reveal that for testing conditional FI,
only few methods are available and practitioners have hitherto been severely
restricted in method application due to mismatching data requirements. Most
real-world data exhibits complex feature dependencies and incorporates both
continuous and categorical data (mixed data). Both properties are oftentimes
neglected by conditional FI measures. To fill this gap, we propose to combine
the conditional predictive impact (CPI) framework with sequential knockoff
sampling. The CPI enables conditional FI measurement that controls for any
feature dependencies by sampling valid knockoffs - hence, generating synthetic
data with similar statistical properties - for the data to be analyzed.
Sequential knockoffs were deliberately designed to handle mixed data and thus
allow us to extend the CPI approach to such datasets. We demonstrate through
numerous simulations and a real-world example that our proposed workflow
controls type I error, achieves high power and is in line with results given by
other conditional FI measures, whereas marginal FI metrics result in misleading
interpretations. Our findings highlight the necessity of developing
statistically adequate, specialized methods for mixed data.
Related papers
- Geometry-Aware Instrumental Variable Regression [56.16884466478886]
We propose a transport-based IV estimator that takes into account the geometry of the data manifold through data-derivative information.
We provide a simple plug-and-play implementation of our method that performs on par with related estimators in standard settings.
arXiv Detail & Related papers (2024-05-19T17:49:33Z) - Benchmark Transparency: Measuring the Impact of Data on Evaluation [6.307485015636125]
We propose an automated framework that measures the data point distribution across 6 different dimensions.
We use disproportional stratified sampling to measure how much the data distribution affects absolute (Acc/F1) and relative (Rank) model performance.
We find that the impact of the data is statistically significant and is often larger than the impact of changing the metric.
arXiv Detail & Related papers (2024-03-31T17:33:43Z) - DAGnosis: Localized Identification of Data Inconsistencies using
Structures [73.39285449012255]
Identification and appropriate handling of inconsistencies in data at deployment time is crucial to reliably use machine learning models.
We use directed acyclic graphs (DAGs) to encode the training set's features probability distribution and independencies as a structure.
Our method, called DAGnosis, leverages these structural interactions to bring valuable and insightful data-centric conclusions.
arXiv Detail & Related papers (2024-02-26T11:29:16Z) - On the Performance of Empirical Risk Minimization with Smoothed Data [59.3428024282545]
Empirical Risk Minimization (ERM) is able to achieve sublinear error whenever a class is learnable with iid data.
We show that ERM is able to achieve sublinear error whenever a class is learnable with iid data.
arXiv Detail & Related papers (2024-02-22T21:55:41Z) - Federated Causal Discovery from Heterogeneous Data [70.31070224690399]
We propose a novel FCD method attempting to accommodate arbitrary causal models and heterogeneous data.
These approaches involve constructing summary statistics as a proxy of the raw data to protect data privacy.
We conduct extensive experiments on synthetic and real datasets to show the efficacy of our method.
arXiv Detail & Related papers (2024-02-20T18:53:53Z) - Perturbation-based Effect Measures for Compositional Data [3.9543275888781224]
Existing effect measures for compositional features are inadequate for many modern applications.
We propose a framework based on hypothetical data perturbations that addresses both issues.
We show how average perturbation effects can be estimated efficiently by deriving a perturbation-dependent reparametrization.
arXiv Detail & Related papers (2023-11-30T12:27:15Z) - Differentially Private Linear Regression with Linked Data [3.9325957466009203]
Differential privacy, a mathematical notion from computer science, is a rising tool offering robust privacy guarantees.
Recent work focuses on developing differentially private versions of individual statistical and machine learning tasks.
We present two differentially private algorithms for linear regression with linked data.
arXiv Detail & Related papers (2023-08-01T21:00:19Z) - Simultaneous Improvement of ML Model Fairness and Performance by
Identifying Bias in Data [1.76179873429447]
We propose a data preprocessing technique that can detect instances ascribing a specific kind of bias that should be removed from the dataset before training.
In particular, we claim that in the problem settings where instances exist with similar feature but different labels caused by variation in protected attributes, an inherent bias gets induced in the dataset.
arXiv Detail & Related papers (2022-10-24T13:04:07Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Disentanglement and Generalization Under Correlation Shifts [22.499106910581958]
Correlations between factors of variation are prevalent in real-world data.
Machine learning algorithms may benefit from exploiting such correlations, as they can increase predictive performance on noisy data.
We aim to learn representations which capture different factors of variation in latent subspaces.
arXiv Detail & Related papers (2021-12-29T18:55:17Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.