Related papers: A Statistical View of Column Subset Selection

A Statistical View of Column Subset Selection

URL: http://arxiv.org/abs/2307.12892v1
Date: Mon, 24 Jul 2023 15:42:33 GMT
Title: A Statistical View of Column Subset Selection
Authors: Anav Sood and Trevor Hastie
Abstract summary: We consider the problem of selecting a small subset of representative variables from a large dataset. We show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.
Score: 91.3755431537592
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider the problem of selecting a small subset of representative variables from a large dataset. In the computer science literature, this dimensionality reduction problem is typically formalized as Column Subset Selection (CSS). Meanwhile, the typical statistical formalization is to find an information-maximizing set of Principal Variables. This paper shows that these two approaches are equivalent, and moreover, both can be viewed as maximum likelihood estimation within a certain semi-parametric model. Using these connections, we show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.

Related papers

Latent Semantic Consensus For Deterministic Geometric Model Fitting [109.44565542031384]
We propose an effective method called Latent Semantic Consensus (LSC) LSC formulates the model fitting problem into two latent semantic spaces based on data points and model hypotheses. LSC is able to provide consistent and reliable solutions within only a few milliseconds for general multi-structural model fitting.
arXiv Detail & Related papers (2024-03-11T05:35:38Z)
Computational-Statistical Gaps in Gaussian Single-Index Models [77.1473134227844]
Single-Index Models are high-dimensional regression problems with planted structure. We show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $Omega(dkstar/2)$ samples.
arXiv Detail & Related papers (2024-03-08T18:50:19Z)
Revisiting the Dataset Bias Problem from a Statistical Perspective [72.94990819287551]
We study the "dataset bias" problem from a statistical standpoint. We identify the main cause of the problem as the strong correlation between a class attribute u and a non-class attribute b. We propose to mitigate dataset bias via either weighting the objective of each sample n by frac1p(u_n|b_n) or sampling that sample with a weight proportional to frac1p(u_n|b_n).
arXiv Detail & Related papers (2024-02-05T22:58:06Z)
Tutorial: a priori estimation of sample size, effect size, and statistical power for cluster analysis, latent class analysis, and multivariate mixture models [0.0]
This tutorial provides a roadmap to determining sample size and effect size for analyses that identify subgroups. I introduce a procedure that allows researchers to formalise their expectations about effect sizes in their domain of choice. Next, I outline how to establish the minimum sample size in subgroup analyses.
arXiv Detail & Related papers (2023-09-02T08:48:00Z)
Robust Statistical Comparison of Random Variables with Locally Varying Scale of Measurement [0.562479170374811]
Spaces with locally varying scale of measurement, like multidimensional structures with differently scaled dimensions, are pretty common in statistics and machine learning. We address this problem by considering an order based on (sets of) expectations of random variables mapping into such non-standard spaces. This order contains dominance and expectation order as extreme cases when no, or respectively perfect, cardinal structure is given.
arXiv Detail & Related papers (2023-06-22T11:02:18Z)
Two-Stage Robust and Sparse Distributed Statistical Inference for Large-Scale Data [18.34490939288318]
We address the problem of conducting statistical inference in settings involving large-scale data that may be high-dimensional and contaminated by outliers. We propose a two-stage distributed and robust statistical inference procedures coping with high-dimensional models by promoting sparsity.
arXiv Detail & Related papers (2022-08-17T11:17:47Z)
Test Set Sizing Via Random Matrix Theory [91.3755431537592]
This paper uses techniques from Random Matrix Theory to find the ideal training-testing data split for a simple linear regression. It defines "ideal" as satisfying the integrity metric, i.e. the empirical model error is the actual measurement noise. This paper is the first to solve for the training and test size for any model in a way that is truly optimal.
arXiv Detail & Related papers (2021-12-11T13:18:33Z)
Sampling from Arbitrary Functions via PSD Models [55.41644538483948]
We take a two-step approach by first modeling the probability distribution and then sampling from that model. We show that these models can approximate a large class of densities concisely using few evaluations, and present a simple algorithm to effectively sample from these models.
arXiv Detail & Related papers (2021-10-20T12:25:22Z)
Estimating Graph Dimension with Cross-validated Eigenvalues [5.0013150536632995]
In applied statistics, estimating the number of latent dimensions or the number of clusters is a fundamental and recurring problem. We provide a cross-validated eigenvalues approach to this problem. We prove that our procedure consistently estimates $k$ in scenarios where all $k$ dimensions can be estimated.
arXiv Detail & Related papers (2021-08-06T23:52:30Z)
Manifold Hypothesis in Data Analysis: Double Geometrically-Probabilistic Approach to Manifold Dimension Estimation [92.81218653234669]
We present new approach to manifold hypothesis checking and underlying manifold dimension estimation. Our geometrical method is a modification for sparse data of a well-known box-counting algorithm for Minkowski dimension calculation. Experiments on real datasets show that the suggested approach based on two methods combination is powerful and effective.
arXiv Detail & Related papers (2021-07-08T15:35:54Z)
Fisher's combined probability test for high-dimensional covariance matrices [0.0]
We propose a scale-invariant power enhancement test based on Fisher's method to combine the p-values of quadratic form statistics and maximum form statistics. We prove that the proposed combination method retains the correct size and boosts the power against more general alternatives.
arXiv Detail & Related papers (2020-05-31T03:32:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.