A Statistical View of Column Subset Selection
- URL: http://arxiv.org/abs/2307.12892v2
- Date: Mon, 21 Oct 2024 02:07:11 GMT
- Title: A Statistical View of Column Subset Selection
- Authors: Anav Sood, Trevor Hastie,
- Abstract summary: We consider the problem of selecting a small subset of representative variables from a large dataset.
We show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.
- Score: 47.65143789184956
- License:
- Abstract: We consider the problem of selecting a small subset of representative variables from a large dataset. In the computer science literature, this dimensionality reduction problem is typically formalized as Column Subset Selection (CSS). Meanwhile, the typical statistical formalization is to find an information-maximizing set of Principal Variables. This paper shows that these two approaches are equivalent, and moreover, both can be viewed as maximum likelihood estimation within a certain semi-parametric model. Within this model, we establish suitable conditions under which the CSS estimate is consistent in high dimensions, specifically in the proportional asymptotic regime where the number of variables over the sample size converges to a constant. Using these connections, we show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.
Related papers
- Latent Semantic Consensus For Deterministic Geometric Model Fitting [109.44565542031384]
We propose an effective method called Latent Semantic Consensus (LSC)
LSC formulates the model fitting problem into two latent semantic spaces based on data points and model hypotheses.
LSC is able to provide consistent and reliable solutions within only a few milliseconds for general multi-structural model fitting.
arXiv Detail & Related papers (2024-03-11T05:35:38Z) - Computational-Statistical Gaps in Gaussian Single-Index Models [77.1473134227844]
Single-Index Models are high-dimensional regression problems with planted structure.
We show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $Omega(dkstar/2)$ samples.
arXiv Detail & Related papers (2024-03-08T18:50:19Z) - Tutorial: a priori estimation of sample size, effect size, and
statistical power for cluster analysis, latent class analysis, and
multivariate mixture models [0.0]
This tutorial provides a roadmap to determining sample size and effect size for analyses that identify subgroups.
I introduce a procedure that allows researchers to formalise their expectations about effect sizes in their domain of choice.
Next, I outline how to establish the minimum sample size in subgroup analyses.
arXiv Detail & Related papers (2023-09-02T08:48:00Z) - Robust Statistical Comparison of Random Variables with Locally Varying
Scale of Measurement [0.562479170374811]
Spaces with locally varying scale of measurement, like multidimensional structures with differently scaled dimensions, are pretty common in statistics and machine learning.
We address this problem by considering an order based on (sets of) expectations of random variables mapping into such non-standard spaces.
This order contains dominance and expectation order as extreme cases when no, or respectively perfect, cardinal structure is given.
arXiv Detail & Related papers (2023-06-22T11:02:18Z) - Two-Stage Robust and Sparse Distributed Statistical Inference for
Large-Scale Data [18.34490939288318]
We address the problem of conducting statistical inference in settings involving large-scale data that may be high-dimensional and contaminated by outliers.
We propose a two-stage distributed and robust statistical inference procedures coping with high-dimensional models by promoting sparsity.
arXiv Detail & Related papers (2022-08-17T11:17:47Z) - Test Set Sizing Via Random Matrix Theory [91.3755431537592]
This paper uses techniques from Random Matrix Theory to find the ideal training-testing data split for a simple linear regression.
It defines "ideal" as satisfying the integrity metric, i.e. the empirical model error is the actual measurement noise.
This paper is the first to solve for the training and test size for any model in a way that is truly optimal.
arXiv Detail & Related papers (2021-12-11T13:18:33Z) - Sampling from Arbitrary Functions via PSD Models [55.41644538483948]
We take a two-step approach by first modeling the probability distribution and then sampling from that model.
We show that these models can approximate a large class of densities concisely using few evaluations, and present a simple algorithm to effectively sample from these models.
arXiv Detail & Related papers (2021-10-20T12:25:22Z) - Estimating Graph Dimension with Cross-validated Eigenvalues [5.0013150536632995]
In applied statistics, estimating the number of latent dimensions or the number of clusters is a fundamental and recurring problem.
We provide a cross-validated eigenvalues approach to this problem.
We prove that our procedure consistently estimates $k$ in scenarios where all $k$ dimensions can be estimated.
arXiv Detail & Related papers (2021-08-06T23:52:30Z) - Manifold Hypothesis in Data Analysis: Double Geometrically-Probabilistic
Approach to Manifold Dimension Estimation [92.81218653234669]
We present new approach to manifold hypothesis checking and underlying manifold dimension estimation.
Our geometrical method is a modification for sparse data of a well-known box-counting algorithm for Minkowski dimension calculation.
Experiments on real datasets show that the suggested approach based on two methods combination is powerful and effective.
arXiv Detail & Related papers (2021-07-08T15:35:54Z) - Fisher's combined probability test for high-dimensional covariance
matrices [0.0]
We propose a scale-invariant power enhancement test based on Fisher's method to combine the p-values of quadratic form statistics and maximum form statistics.
We prove that the proposed combination method retains the correct size and boosts the power against more general alternatives.
arXiv Detail & Related papers (2020-05-31T03:32:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.