Linear Discriminant Analysis with High-dimensional Mixed Variables
- URL: http://arxiv.org/abs/2112.07145v3
- Date: Tue, 2 Jan 2024 09:27:21 GMT
- Title: Linear Discriminant Analysis with High-dimensional Mixed Variables
- Authors: Binyan Jiang, Chenlei Leng, Cheng Wang, Zhongqing Yang, Xinyang Yu
- Abstract summary: This paper develops a novel approach for classifying high-dimensional observations with mixed variables.
We overcome the challenge of having to split data into exponentially many cells.
Results on the estimation accuracy and the misclassification rates are established.
- Score: 10.774094462083843
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Datasets containing both categorical and continuous variables are frequently
encountered in many areas, and with the rapid development of modern measurement
technologies, the dimensions of these variables can be very high. Despite the
recent progress made in modelling high-dimensional data for continuous
variables, there is a scarcity of methods that can deal with a mixed set of
variables. To fill this gap, this paper develops a novel approach for
classifying high-dimensional observations with mixed variables. Our framework
builds on a location model, in which the distributions of the continuous
variables conditional on categorical ones are assumed Gaussian. We overcome the
challenge of having to split data into exponentially many cells, or
combinations of the categorical variables, by kernel smoothing, and provide new
perspectives for its bandwidth choice to ensure an analogue of Bochner's Lemma,
which is different to the usual bias-variance tradeoff. We show that the two
sets of parameters in our model can be separately estimated and provide
penalized likelihood for their estimation. Results on the estimation accuracy
and the misclassification rates are established, and the competitive
performance of the proposed classifier is illustrated by extensive simulation
and real data studies.
Related papers
- CAVIAR: Categorical-Variable Embeddings for Accurate and Robust Inference [0.2209921757303168]
Social science research often hinges on the relationship between categorical variables and outcomes.
We introduce CAVIAR, a novel method for embedding categorical variables that assume values in a high-dimensional ambient space but are sampled from an underlying manifold.
arXiv Detail & Related papers (2024-04-07T14:47:07Z) - Joint Distributional Learning via Cramer-Wold Distance [0.7614628596146602]
We introduce the Cramer-Wold distance regularization, which can be computed in a closed-form, to facilitate joint distributional learning for high-dimensional datasets.
We also introduce a two-step learning method to enable flexible prior modeling and improve the alignment between the aggregated posterior and the prior distribution.
arXiv Detail & Related papers (2023-10-25T05:24:23Z) - Conformal inference for regression on Riemannian Manifolds [49.7719149179179]
We investigate prediction sets for regression scenarios when the response variable, denoted by $Y$, resides in a manifold, and the covariable, denoted by X, lies in Euclidean space.
We prove the almost sure convergence of the empirical version of these regions on the manifold to their population counterparts.
arXiv Detail & Related papers (2023-10-12T10:56:25Z) - Heterogeneous Multi-Task Gaussian Cox Processes [61.67344039414193]
We present a novel extension of multi-task Gaussian Cox processes for modeling heterogeneous correlated tasks jointly.
A MOGP prior over the parameters of the dedicated likelihoods for classification, regression and point process tasks can facilitate sharing of information between heterogeneous tasks.
We derive a mean-field approximation to realize closed-form iterative updates for estimating model parameters.
arXiv Detail & Related papers (2023-08-29T15:01:01Z) - Towards Better Certified Segmentation via Diffusion Models [62.21617614504225]
segmentation models can be vulnerable to adversarial perturbations, which hinders their use in critical-decision systems like healthcare or autonomous driving.
Recently, randomized smoothing has been proposed to certify segmentation predictions by adding Gaussian noise to the input to obtain theoretical guarantees.
In this paper, we address the problem of certifying segmentation prediction using a combination of randomized smoothing and diffusion models.
arXiv Detail & Related papers (2023-06-16T16:30:39Z) - High-Dimensional Undirected Graphical Models for Arbitrary Mixed Data [2.2871867623460207]
In many applications data span variables of different types, whose principled joint analysis is nontrivial.
Recent advances have shown how the binary-continuous case can be tackled, but the general mixed variable type regime remains challenging.
We propose flexible and scalable methodology for data with variables of entirely general mixed type.
arXiv Detail & Related papers (2022-11-21T18:21:31Z) - A Graphical Model for Fusing Diverse Microbiome Data [2.385985842958366]
We introduce a flexible multinomial-Gaussian generative model for jointly modeling such count data.
We present a computationally scalable variational Expectation-Maximization (EM) algorithm for inferring the latent variables and the parameters of the model.
arXiv Detail & Related papers (2022-08-21T17:54:39Z) - On the Strong Correlation Between Model Invariance and Generalization [54.812786542023325]
Generalization captures a model's ability to classify unseen data.
Invariance measures consistency of model predictions on transformations of the data.
From a dataset-centric view, we find a certain model's accuracy and invariance linearly correlated on different test sets.
arXiv Detail & Related papers (2022-07-14T17:08:25Z) - Learning Exponential Family Graphical Models with Latent Variables using
Regularized Conditional Likelihood [10.21814909876358]
We present a new convex relaxation framework based on regularized conditional likelihood for latent-variable graphical modeling.
We demonstrate the utility and flexibility of our framework via a series of numerical experiments on synthetic as well as real data.
arXiv Detail & Related papers (2020-10-19T11:16:26Z) - Generalized Matrix Factorization: efficient algorithms for fitting
generalized linear latent variable models to large data arrays [62.997667081978825]
Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses.
Current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets.
We propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood.
arXiv Detail & Related papers (2020-10-06T04:28:19Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.