Non-parametric Conditional Independence Testing for Mixed
Continuous-Categorical Variables: A Novel Method and Numerical Evaluation
- URL: http://arxiv.org/abs/2310.11132v2
- Date: Sun, 5 Nov 2023 10:11:28 GMT
- Title: Non-parametric Conditional Independence Testing for Mixed
Continuous-Categorical Variables: A Novel Method and Numerical Evaluation
- Authors: Oana-Iuliana Popescu, Andreas Gerhardus, Jakob Runge
- Abstract summary: Conditional independence testing (CIT) is a common task in machine learning.
Many real-world applications involve mixed-type datasets that include numerical and categorical variables.
We propose a variation of the former approach that does not treat categorical variables as numeric.
- Score: 14.993705256147189
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conditional independence testing (CIT) is a common task in machine learning,
e.g., for variable selection, and a main component of constraint-based causal
discovery. While most current CIT approaches assume that all variables are
numerical or all variables are categorical, many real-world applications
involve mixed-type datasets that include numerical and categorical variables.
Non-parametric CIT can be conducted using conditional mutual information (CMI)
estimators combined with a local permutation scheme. Recently, two novel CMI
estimators for mixed-type datasets based on k-nearest-neighbors (k-NN) have
been proposed. As with any k-NN method, these estimators rely on the definition
of a distance metric. One approach computes distances by a one-hot encoding of
the categorical variables, essentially treating categorical variables as
discrete-numerical, while the other expresses CMI by entropy terms where the
categorical variables appear as conditions only. In this work, we study these
estimators and propose a variation of the former approach that does not treat
categorical variables as numeric. Our numerical experiments show that our
variant detects dependencies more robustly across different data distributions
and preprocessing types.
Related papers
- Semiparametric conformal prediction [79.6147286161434]
Risk-sensitive applications require well-calibrated prediction sets over multiple, potentially correlated target variables.
We treat the scores as random vectors and aim to construct the prediction set accounting for their joint correlation structure.
We report desired coverage and competitive efficiency on a range of real-world regression problems.
arXiv Detail & Related papers (2024-11-04T14:29:02Z) - Meta-Learners for Partially-Identified Treatment Effects Across Multiple Environments [67.80453452949303]
Estimating the conditional average treatment effect (CATE) from observational data is relevant for many applications such as personalized medicine.
Here, we focus on the widespread setting where the observational data come from multiple environments.
We propose different model-agnostic learners (so-called meta-learners) to estimate the bounds that can be used in combination with arbitrary machine learning models.
arXiv Detail & Related papers (2024-06-04T16:31:43Z) - CAVIAR: Categorical-Variable Embeddings for Accurate and Robust Inference [0.2209921757303168]
Social science research often hinges on the relationship between categorical variables and outcomes.
We introduce CAVIAR, a novel method for embedding categorical variables that assume values in a high-dimensional ambient space but are sampled from an underlying manifold.
arXiv Detail & Related papers (2024-04-07T14:47:07Z) - Gower's similarity coefficients with automatic weight selection [0.0]
The most popular dissimilarity for mixed-type variables is derived as the complement to one of the Gower's similarity coefficient.
The discussion on the weighting schemes is sometimes misleading since it often ignores that the unweighted "standard" setting hides an unbalanced contribution of the single variables to the overall dissimilarity.
We address this drawback following the recent idea of introducing a weighting scheme that minimizes the differences in the correlation between each contributing dissimilarity and the resulting weighted Gower's dissimilarity.
arXiv Detail & Related papers (2024-01-30T14:21:56Z) - DCID: Deep Canonical Information Decomposition [84.59396326810085]
We consider the problem of identifying the signal shared between two one-dimensional target variables.
We propose ICM, an evaluation metric which can be used in the presence of ground-truth labels.
We also propose Deep Canonical Information Decomposition (DCID) - a simple, yet effective approach for learning the shared variables.
arXiv Detail & Related papers (2023-06-27T16:59:06Z) - Predicting Out-of-Domain Generalization with Neighborhood Invariance [59.05399533508682]
We propose a measure of a classifier's output invariance in a local transformation neighborhood.
Our measure is simple to calculate, does not depend on the test point's true label, and can be applied even in out-of-domain (OOD) settings.
In experiments on benchmarks in image classification, sentiment analysis, and natural language inference, we demonstrate a strong and robust correlation between our measure and actual OOD generalization.
arXiv Detail & Related papers (2022-07-05T14:55:16Z) - Linear Discriminant Analysis with High-dimensional Mixed Variables [10.774094462083843]
This paper develops a novel approach for classifying high-dimensional observations with mixed variables.
We overcome the challenge of having to split data into exponentially many cells.
Results on the estimation accuracy and the misclassification rates are established.
arXiv Detail & Related papers (2021-12-14T03:57:56Z) - MURAL: An Unsupervised Random Forest-Based Embedding for Electronic
Health Record Data [59.26381272149325]
We present an unsupervised random forest for representing data with disparate variable types.
MURAL forests consist of a set of decision trees where node-splitting variables are chosen at random.
We show that using our approach, we can visualize and classify data more accurately than competing approaches.
arXiv Detail & Related papers (2021-11-19T22:02:21Z) - CARMS: Categorical-Antithetic-REINFORCE Multi-Sample Gradient Estimator [60.799183326613395]
We propose an unbiased estimator for categorical random variables based on multiple mutually negatively correlated (jointly antithetic) samples.
CARMS combines REINFORCE with copula based sampling to avoid duplicate samples and reduce its variance, while keeping the estimator unbiased using importance sampling.
We evaluate CARMS on several benchmark datasets on a generative modeling task, as well as a structured output prediction task, and find it to outperform competing methods including a strong self-control baseline.
arXiv Detail & Related papers (2021-10-26T20:14:30Z) - An Embedded Model Estimator for Non-Stationary Random Functions using
Multiple Secondary Variables [0.0]
This paper introduces the method and shows that it has consistency results that are similar in nature to those applying to geostatistical modelling and to Quantile Random Forests.
The algorithm works by estimating a conditional distribution for the target variable at each target location.
arXiv Detail & Related papers (2020-11-09T00:14:24Z) - $\ell_0$-based Sparse Canonical Correlation Analysis [7.073210405344709]
Canonical Correlation Analysis (CCA) models are powerful for studying the associations between two sets of variables.
Despite their success, CCA models may break if the number of variables in either of the modalities exceeds the number of samples.
Here, we propose $ell_0$-CCA, a method for learning correlated representations based on sparse subsets of two observed modalities.
arXiv Detail & Related papers (2020-10-12T11:44:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.