Addressing Dynamic and Sparse Qualitative Data: A Hilbert Space
Embedding of Categorical Variables
- URL: http://arxiv.org/abs/2308.11781v1
- Date: Tue, 22 Aug 2023 20:40:31 GMT
- Title: Addressing Dynamic and Sparse Qualitative Data: A Hilbert Space
Embedding of Categorical Variables
- Authors: Anirban Mukherjee and Hannah H. Chang
- Abstract summary: We propose a novel framework for incorporating qualitative data into quantitative models for causal estimation.
We use functional analysis to create a more nuanced and flexible framework.
We validate our model through comprehensive simulation evidence and demonstrate its relevance in a real-world study.
- Score: 0.26107298043931204
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a novel framework for incorporating qualitative data into
quantitative models for causal estimation. Previous methods use categorical
variables derived from qualitative data to build quantitative models. However,
this approach can lead to data-sparse categories and yield inconsistent
(asymptotically biased) and imprecise (finite sample biased) estimates if the
qualitative information is dynamic and intricate. We use functional analysis to
create a more nuanced and flexible framework. We embed the observed categories
into a latent Baire space and introduce a continuous linear map -- a Hilbert
space embedding -- from the Baire space of categories to a Reproducing Kernel
Hilbert Space (RKHS) of representation functions. Through the Riesz
representation theorem, we establish that the canonical treatment of
categorical variables in causal models can be transformed into an identified
structure in the RKHS. Transfer learning acts as a catalyst to streamline
estimation -- embeddings from traditional models are paired with the kernel
trick to form the Hilbert space embedding. We validate our model through
comprehensive simulation evidence and demonstrate its relevance in a real-world
study that contrasts theoretical predictions from economics and psychology in
an e-commerce marketplace. The results confirm the superior performance of our
model, particularly in scenarios where qualitative information is nuanced and
complex.
Related papers
- GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - CAVIAR: Categorical-Variable Embeddings for Accurate and Robust Inference [0.2209921757303168]
Social science research often hinges on the relationship between categorical variables and outcomes.
We introduce CAVIAR, a novel method for embedding categorical variables that assume values in a high-dimensional ambient space but are sampled from an underlying manifold.
arXiv Detail & Related papers (2024-04-07T14:47:07Z) - Directed Cyclic Graph for Causal Discovery from Multivariate Functional
Data [15.26007975367927]
We introduce a functional linear structural equation model for causal structure learning.
To enhance interpretability, our model involves a low-dimensional causal embedded space.
We prove that the proposed model is causally identifiable under standard assumptions.
arXiv Detail & Related papers (2023-10-31T15:19:24Z) - Discovering Interpretable Physical Models using Symbolic Regression and
Discrete Exterior Calculus [55.2480439325792]
We propose a framework that combines Symbolic Regression (SR) and Discrete Exterior Calculus (DEC) for the automated discovery of physical models.
DEC provides building blocks for the discrete analogue of field theories, which are beyond the state-of-the-art applications of SR to physical problems.
We prove the effectiveness of our methodology by re-discovering three models of Continuum Physics from synthetic experimental data.
arXiv Detail & Related papers (2023-10-10T13:23:05Z) - Kalman Filter for Online Classification of Non-Stationary Data [101.26838049872651]
In Online Continual Learning (OCL) a learning system receives a stream of data and sequentially performs prediction and training steps.
We introduce a probabilistic Bayesian online learning model by using a neural representation and a state space model over the linear predictor weights.
In experiments in multi-class classification we demonstrate the predictive ability of the model and its flexibility to capture non-stationarity.
arXiv Detail & Related papers (2023-06-14T11:41:42Z) - Representer Point Selection for Explaining Regularized High-dimensional
Models [105.75758452952357]
We introduce a class of sample-based explanations we term high-dimensional representers.
Our workhorse is a novel representer theorem for general regularized high-dimensional models.
We study the empirical performance of our proposed methods on three real-world binary classification datasets and two recommender system datasets.
arXiv Detail & Related papers (2023-05-31T16:23:58Z) - On the Influence of Enforcing Model Identifiability on Learning dynamics
of Gaussian Mixture Models [14.759688428864159]
We propose a technique for extracting submodels from singular models.
Our method enforces model identifiability during training.
We show how the method can be applied to more complex models like deep neural networks.
arXiv Detail & Related papers (2022-06-17T07:50:22Z) - Linear Discriminant Analysis with High-dimensional Mixed Variables [10.774094462083843]
This paper develops a novel approach for classifying high-dimensional observations with mixed variables.
We overcome the challenge of having to split data into exponentially many cells.
Results on the estimation accuracy and the misclassification rates are established.
arXiv Detail & Related papers (2021-12-14T03:57:56Z) - Modeling Massive Spatial Datasets Using a Conjugate Bayesian Linear
Regression Framework [0.0]
A variety of scalable spatial process models have been proposed that can be easily embedded within a hierarchical modeling framework.
This article discusses how point-referenced spatial process models can be cast as a conjugate Bayesian linear regression that can rapidly deliver inference on spatial processes.
arXiv Detail & Related papers (2021-09-09T17:46:00Z) - Improving the Reconstruction of Disentangled Representation Learners via Multi-Stage Modeling [54.94763543386523]
Current autoencoder-based disentangled representation learning methods achieve disentanglement by penalizing the ( aggregate) posterior to encourage statistical independence of the latent factors.
We present a novel multi-stage modeling approach where the disentangled factors are first learned using a penalty-based disentangled representation learning method.
Then, the low-quality reconstruction is improved with another deep generative model that is trained to model the missing correlated latent variables.
arXiv Detail & Related papers (2020-10-25T18:51:15Z) - Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear.
We show that it commonly arises in parameters of discrete multiplicative noise due to variance.
A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.