CAVIAR: Categorical-Variable Embeddings for Accurate and Robust Inference
- URL: http://arxiv.org/abs/2404.04979v2
- Date: Thu, 11 Apr 2024 16:11:33 GMT
- Title: CAVIAR: Categorical-Variable Embeddings for Accurate and Robust Inference
- Authors: Anirban Mukherjee, Hannah Hanwen Chang,
- Abstract summary: Social science research often hinges on the relationship between categorical variables and outcomes.
We introduce CAVIAR, a novel method for embedding categorical variables that assume values in a high-dimensional ambient space but are sampled from an underlying manifold.
- Score: 0.2209921757303168
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Social science research often hinges on the relationship between categorical variables and outcomes. We introduce CAVIAR, a novel method for embedding categorical variables that assume values in a high-dimensional ambient space but are sampled from an underlying manifold. Our theoretical and numerical analyses outline challenges posed by such categorical variables in causal inference. Specifically, dynamically varying and sparse levels can lead to violations of the Donsker conditions and a failure of the estimation functionals to converge to a tight Gaussian process. Traditional approaches, including the exclusion of rare categorical levels and principled variable selection models like LASSO, fall short. CAVIAR embeds the data into a lower-dimensional global coordinate system. The mapping can be derived from both structured and unstructured data, and ensures stable and robust estimates through dimensionality reduction. In a dataset of direct-to-consumer apparel sales, we illustrate how high-dimensional categorical variables, such as zip codes, can be succinctly represented, facilitating inference and analysis.
Related papers
- Reducing the dimensionality and granularity in hierarchical categorical variables [2.089191490381739]
We propose a methodology to obtain a reduced representation of a hierarchical categorical variable.
We show how entity embedding can be applied in a hierarchical setting.
We apply our methodology on a real dataset and find that the reduced hierarchy is an improvement over the original hierarchical structure.
arXiv Detail & Related papers (2024-03-06T11:09:36Z) - Variable Importance in High-Dimensional Settings Requires Grouping [19.095605415846187]
Conditional Permutation Importance (CPI) bypasses PI's limitations in such cases.
Grouping variables statistically via clustering or some prior knowledge gains some power back.
We show that the approach extended with stacking controls the type-I error even with highly-correlated groups.
arXiv Detail & Related papers (2023-12-18T00:21:47Z) - Non-parametric Conditional Independence Testing for Mixed
Continuous-Categorical Variables: A Novel Method and Numerical Evaluation [14.993705256147189]
Conditional independence testing (CIT) is a common task in machine learning.
Many real-world applications involve mixed-type datasets that include numerical and categorical variables.
We propose a variation of the former approach that does not treat categorical variables as numeric.
arXiv Detail & Related papers (2023-10-17T10:29:23Z) - Addressing Dynamic and Sparse Qualitative Data: A Hilbert Space
Embedding of Categorical Variables [0.26107298043931204]
We propose a novel framework for incorporating qualitative data into quantitative models for causal estimation.
We use functional analysis to create a more nuanced and flexible framework.
We validate our model through comprehensive simulation evidence and demonstrate its relevance in a real-world study.
arXiv Detail & Related papers (2023-08-22T20:40:31Z) - Variational Classification [51.2541371924591]
We derive a variational objective to train the model, analogous to the evidence lower bound (ELBO) used to train variational auto-encoders.
Treating inputs to the softmax layer as samples of a latent variable, our abstracted perspective reveals a potential inconsistency.
We induce a chosen latent distribution, instead of the implicit assumption found in a standard softmax layer.
arXiv Detail & Related papers (2023-05-17T17:47:19Z) - Predicting Out-of-Domain Generalization with Neighborhood Invariance [59.05399533508682]
We propose a measure of a classifier's output invariance in a local transformation neighborhood.
Our measure is simple to calculate, does not depend on the test point's true label, and can be applied even in out-of-domain (OOD) settings.
In experiments on benchmarks in image classification, sentiment analysis, and natural language inference, we demonstrate a strong and robust correlation between our measure and actual OOD generalization.
arXiv Detail & Related papers (2022-07-05T14:55:16Z) - ER: Equivariance Regularizer for Knowledge Graph Completion [107.51609402963072]
We propose a new regularizer, namely, Equivariance Regularizer (ER)
ER can enhance the generalization ability of the model by employing the semantic equivariance between the head and tail entities.
The experimental results indicate a clear and substantial improvement over the state-of-the-art relation prediction methods.
arXiv Detail & Related papers (2022-06-24T08:18:05Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Linear Discriminant Analysis with High-dimensional Mixed Variables [10.774094462083843]
This paper develops a novel approach for classifying high-dimensional observations with mixed variables.
We overcome the challenge of having to split data into exponentially many cells.
Results on the estimation accuracy and the misclassification rates are established.
arXiv Detail & Related papers (2021-12-14T03:57:56Z) - Post-mortem on a deep learning contest: a Simpson's paradox and the
complementary roles of scale metrics versus shape metrics [61.49826776409194]
We analyze a corpus of models made publicly-available for a contest to predict the generalization accuracy of neural network (NN) models.
We identify what amounts to a Simpson's paradox: where "scale" metrics perform well overall but perform poorly on sub partitions of the data.
We present two novel shape metrics, one data-independent, and the other data-dependent, which can predict trends in the test accuracy of a series of NNs.
arXiv Detail & Related papers (2021-06-01T19:19:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.