A Probabilistic Model for Data Redundancy in the Feature Domain
- URL: http://arxiv.org/abs/2309.13657v1
- Date: Sun, 24 Sep 2023 14:51:53 GMT
- Title: A Probabilistic Model for Data Redundancy in the Feature Domain
- Authors: Ghurumuruhan Ganesan
- Abstract summary: We use a probabilistic model to estimate the number of uncorrelated features in a large dataset.
Our model allows for both pairwise feature correlation (collinearity) and interdependency of multiple features (multicollinearity)
We prove an auxiliary result regarding mutually good constrained sets that is of independent interest.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we use a probabilistic model to estimate the number of
uncorrelated features in a large dataset. Our model allows for both pairwise
feature correlation (collinearity) and interdependency of multiple features
(multicollinearity) and we use the probabilistic method to obtain upper and
lower bounds of the same order, for the size of a feature set that exhibits low
collinearity and low multicollinearity. We also prove an auxiliary result
regarding mutually good constrained sets that is of independent interest.
Related papers
- Scalable Regularised Joint Mixture Models [2.0686407686198263]
In many applications, data can be heterogeneous in the sense of spanning latent groups with different underlying distributions.
We propose an approach for heterogeneous data that allows joint learning of (i) explicit multivariate feature distributions, (ii) high-dimensional regression models and (iii) latent group labels.
The approach is demonstrably effective in high dimensions, combining data reduction for computational efficiency with a re-weighting scheme that retains key signals even when the number of features is large.
arXiv Detail & Related papers (2022-05-03T13:38:58Z) - Learning from few examples with nonlinear feature maps [68.8204255655161]
We explore the phenomenon and reveal key relationships between dimensionality of AI model's feature space, non-degeneracy of data distributions, and the model's generalisation capabilities.
The main thrust of our present analysis is on the influence of nonlinear feature transformations mapping original data into higher- and possibly infinite-dimensional spaces on the resulting model's generalisation capabilities.
arXiv Detail & Related papers (2022-03-31T10:36:50Z) - PSD Representations for Effective Probability Models [117.35298398434628]
We show that a recently proposed class of positive semi-definite (PSD) models for non-negative functions is particularly suited to this end.
We characterize both approximation and generalization capabilities of PSD models, showing that they enjoy strong theoretical guarantees.
Our results open the way to applications of PSD models to density estimation, decision theory and inference.
arXiv Detail & Related papers (2021-06-30T15:13:39Z) - Top-$k$ Regularization for Supervised Feature Selection [11.927046591097623]
We introduce a novel, simple yet effective regularization approach, named top-$k$ regularization, to supervised feature selection.
We show that the top-$k$ regularization is effective and stable for supervised feature selection.
arXiv Detail & Related papers (2021-06-04T01:12:47Z) - Information-theoretic Feature Selection via Tensor Decomposition and
Submodularity [38.05393186002834]
We introduce a low-rank tensor model of the joint PMF of all variables and indirect targeting as a way of mitigating complexity and maximizing the classification performance for a given number of features.
By indirectly aiming to predict the latent variable of the naive Bayes model instead of the original target variable, it is possible to formulate the feature selection problem as of a monotone submodular function subject to a cardinality constraint.
arXiv Detail & Related papers (2020-10-30T10:36:46Z) - Probabilistic Circuits for Variational Inference in Discrete Graphical
Models [101.28528515775842]
Inference in discrete graphical models with variational methods is difficult.
Many sampling-based methods have been proposed for estimating Evidence Lower Bound (ELBO)
We propose a new approach that leverages the tractability of probabilistic circuit models, such as Sum Product Networks (SPN)
We show that selective-SPNs are suitable as an expressive variational distribution, and prove that when the log-density of the target model is aweighted the corresponding ELBO can be computed analytically.
arXiv Detail & Related papers (2020-10-22T05:04:38Z) - Out-of-distribution Generalization via Partial Feature Decorrelation [72.96261704851683]
We present a novel Partial Feature Decorrelation Learning (PFDL) algorithm, which jointly optimize a feature decomposition network and the target image classification model.
The experiments on real-world datasets demonstrate that our method can improve the backbone model's accuracy on OOD image classification datasets.
arXiv Detail & Related papers (2020-07-30T05:48:48Z) - Accounting for Unobserved Confounding in Domain Generalization [107.0464488046289]
This paper investigates the problem of learning robust, generalizable prediction models from a combination of datasets.
Part of the challenge of learning robust models lies in the influence of unobserved confounders.
We demonstrate the empirical performance of our approach on healthcare data from different modalities.
arXiv Detail & Related papers (2020-07-21T08:18:06Z) - Bayesian Sparse Factor Analysis with Kernelized Observations [67.60224656603823]
Multi-view problems can be faced with latent variable models.
High-dimensionality and non-linear issues are traditionally handled by kernel methods.
We propose merging both approaches into single model.
arXiv Detail & Related papers (2020-06-01T14:25:38Z) - Learning Ising models from one or multiple samples [26.00403702328348]
We provide guarantees for one-sample estimation, quantifying the estimation error in terms of the metric entropy of a family of interaction matrices.
Our technical approach benefits from sparsifying a model's interaction network, conditioning on subsets of variables that make the dependencies in the resulting conditional distribution sufficiently weak.
arXiv Detail & Related papers (2020-04-20T15:17:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.