A Statistical View of Column Subset Selection
- URL: http://arxiv.org/abs/2307.12892v1
- Date: Mon, 24 Jul 2023 15:42:33 GMT
- Title: A Statistical View of Column Subset Selection
- Authors: Anav Sood and Trevor Hastie
- Abstract summary: We consider the problem of selecting a small subset of representative variables from a large dataset.
We show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.
- Score: 91.3755431537592
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of selecting a small subset of representative
variables from a large dataset. In the computer science literature, this
dimensionality reduction problem is typically formalized as Column Subset
Selection (CSS). Meanwhile, the typical statistical formalization is to find an
information-maximizing set of Principal Variables. This paper shows that these
two approaches are equivalent, and moreover, both can be viewed as maximum
likelihood estimation within a certain semi-parametric model. Using these
connections, we show how to efficiently (1) perform CSS using only summary
statistics from the original dataset; (2) perform CSS in the presence of
missing and/or censored data; and (3) select the subset size for CSS in a
hypothesis testing framework.
Related papers
- Computational-Statistical Gaps in Gaussian Single-Index Models [77.1473134227844]
Single-Index Models are high-dimensional regression problems with planted structure.
We show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $Omega(dkstar/2)$ samples.
arXiv Detail & Related papers (2024-03-08T18:50:19Z) - Revisiting the Dataset Bias Problem from a Statistical Perspective [72.94990819287551]
We study the "dataset bias" problem from a statistical standpoint.
We identify the main cause of the problem as the strong correlation between a class attribute u and a non-class attribute b.
We propose to mitigate dataset bias via either weighting the objective of each sample n by frac1p(u_n|b_n) or sampling that sample with a weight proportional to frac1p(u_n|b_n).
arXiv Detail & Related papers (2024-02-05T22:58:06Z) - Variance Alignment Score: A Simple But Tough-to-Beat Data Selection
Method for Multimodal Contrastive Learning [17.40655778450583]
We propose a principled metric named Variance Alignment Score (VAS), which has the form $langle Sigma_texttest, Sigma_irangle$.
We show that applying VAS and CLIP scores together can outperform baselines by a margin of $1.3%$ on 38 evaluation sets for noisy dataset DataComp and $2.5%$ on VTAB for high-quality dataset CC12M.
arXiv Detail & Related papers (2024-02-03T06:29:04Z) - Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs)
This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias"
We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z) - Tutorial: a priori estimation of sample size, effect size, and
statistical power for cluster analysis, latent class analysis, and
multivariate mixture models [0.0]
This tutorial provides a roadmap to determining sample size and effect size for analyses that identify subgroups.
I introduce a procedure that allows researchers to formalise their expectations about effect sizes in their domain of choice.
Next, I outline how to establish the minimum sample size in subgroup analyses.
arXiv Detail & Related papers (2023-09-02T08:48:00Z) - Subsampling Suffices for Adaptive Data Analysis [8.231050911072755]
Most classical techniques assume that the dataset is independent of the analyst's query and break down in the common setting where a dataset is reused for multiple, adaptively chosen, queries.
We identify a remarkably simple set of assumptions under which the queries will continue to be representative even when chosen adaptively.
The simplicity of this subsampling-based framework allows it to model a variety of real-world scenarios not covered by prior work.
arXiv Detail & Related papers (2023-02-17T02:47:54Z) - Test Set Sizing Via Random Matrix Theory [91.3755431537592]
This paper uses techniques from Random Matrix Theory to find the ideal training-testing data split for a simple linear regression.
It defines "ideal" as satisfying the integrity metric, i.e. the empirical model error is the actual measurement noise.
This paper is the first to solve for the training and test size for any model in a way that is truly optimal.
arXiv Detail & Related papers (2021-12-11T13:18:33Z) - Auto-weighted Multi-view Feature Selection with Graph Optimization [90.26124046530319]
We propose a novel unsupervised multi-view feature selection model based on graph learning.
The contributions are threefold: (1) during the feature selection procedure, the consensus similarity graph shared by different views is learned.
Experiments on various datasets demonstrate the superiority of the proposed method compared with the state-of-the-art methods.
arXiv Detail & Related papers (2021-04-11T03:25:25Z) - Selection of Summary Statistics for Network Model Choice with
Approximate Bayesian Computation [1.8884278918443564]
We study the utility of cost-based filter selection methods to account for different summary costs during the selection process.
Our findings show that computationally inexpensive summary statistics can be efficiently selected with minimal impact on classification accuracy.
arXiv Detail & Related papers (2021-01-19T18:21:06Z) - Self-Representation Based Unsupervised Exemplar Selection in a Union of
Subspaces [27.22427926657327]
We present a new exemplar selection model that searches for a subset that best reconstructs all data points as measured by the $ell_1$ norm of the representation coefficients.
When the dataset is drawn from a union of independent subspaces, our method is able to select sufficiently many representatives from each subspace.
We also develop an exemplar based subspace clustering method that is robust to imbalanced data and efficient for large scale data.
arXiv Detail & Related papers (2020-06-07T19:43:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.