Estimating Graph Dimension with Cross-validated Eigenvalues
- URL: http://arxiv.org/abs/2108.03336v1
- Date: Fri, 6 Aug 2021 23:52:30 GMT
- Title: Estimating Graph Dimension with Cross-validated Eigenvalues
- Authors: Fan Chen, Sebastien Roch, Karl Rohe, Shuqi Yu
- Abstract summary: In applied statistics, estimating the number of latent dimensions or the number of clusters is a fundamental and recurring problem.
We provide a cross-validated eigenvalues approach to this problem.
We prove that our procedure consistently estimates $k$ in scenarios where all $k$ dimensions can be estimated.
- Score: 5.0013150536632995
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In applied multivariate statistics, estimating the number of latent
dimensions or the number of clusters is a fundamental and recurring problem.
One common diagnostic is the scree plot, which shows the largest eigenvalues of
the data matrix; the user searches for a "gap" or "elbow" in the decreasing
eigenvalues; unfortunately, these patterns can hide beneath the bias of the
sample eigenvalues. This methodological problem is conceptually difficult
because, in many situations, there is only enough signal to detect a subset of
the $k$ population dimensions/eigenvectors. In this situation, one could argue
that the correct choice of $k$ is the number of detectable dimensions. We
alleviate these problems with cross-validated eigenvalues. Under a large class
of random graph models, without any parametric assumptions, we provide a
p-value for each sample eigenvector. It tests the null hypothesis that this
sample eigenvector is orthogonal to (i.e., uncorrelated with) the true latent
dimensions. This approach naturally adapts to problems where some dimensions
are not statistically detectable. In scenarios where all $k$ dimensions can be
estimated, we prove that our procedure consistently estimates $k$. In
simulations and a data example, the proposed estimator compares favorably to
alternative approaches in both computational and statistical performance.
Related papers
- Insufficient Statistics Perturbation: Stable Estimators for Private Least Squares [38.478776450327125]
We present a sample- and time-efficient differentially private algorithm for ordinary least squares.
Our near-optimal accuracy holds for any dataset with the condition number, or exponential time.
arXiv Detail & Related papers (2024-04-23T18:00:38Z) - Computational-Statistical Gaps in Gaussian Single-Index Models [77.1473134227844]
Single-Index Models are high-dimensional regression problems with planted structure.
We show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $Omega(dkstar/2)$ samples.
arXiv Detail & Related papers (2024-03-08T18:50:19Z) - Sparse PCA with Oracle Property [115.72363972222622]
We propose a family of estimators based on the semidefinite relaxation of sparse PCA with novel regularizations.
We prove that, another estimator within the family achieves a sharper statistical rate of convergence than the standard semidefinite relaxation of sparse PCA.
arXiv Detail & Related papers (2023-12-28T02:52:54Z) - Mean Estimation with User-level Privacy under Data Heterogeneity [54.07947274508013]
Different users may possess vastly different numbers of data points.
It cannot be assumed that all users sample from the same underlying distribution.
We propose a simple model of heterogeneous user data that allows user data to differ in both distribution and quantity of data.
arXiv Detail & Related papers (2023-07-28T23:02:39Z) - Conformalization of Sparse Generalized Linear Models [2.1485350418225244]
Conformal prediction method estimates a confidence set for $y_n+1$ that is valid for any finite sample size.
Although attractive, computing such a set is computationally infeasible in most regression problems.
We show how our path-following algorithm accurately approximates conformal prediction sets.
arXiv Detail & Related papers (2023-07-11T08:36:12Z) - Distributed Semi-Supervised Sparse Statistical Inference [6.685997976921953]
A debiased estimator is a crucial tool in statistical inference for high-dimensional model parameters.
Traditional methods require computing a debiased estimator on every machine.
An efficient multi-round distributed debiased estimator, which integrates both labeled and unlabelled data, is developed.
arXiv Detail & Related papers (2023-06-17T17:30:43Z) - Test Set Sizing Via Random Matrix Theory [91.3755431537592]
This paper uses techniques from Random Matrix Theory to find the ideal training-testing data split for a simple linear regression.
It defines "ideal" as satisfying the integrity metric, i.e. the empirical model error is the actual measurement noise.
This paper is the first to solve for the training and test size for any model in a way that is truly optimal.
arXiv Detail & Related papers (2021-12-11T13:18:33Z) - Analysis of Truncated Orthogonal Iteration for Sparse Eigenvector
Problems [78.95866278697777]
We propose two variants of the Truncated Orthogonal Iteration to compute multiple leading eigenvectors with sparsity constraints simultaneously.
We then apply our algorithms to solve the sparse principle component analysis problem for a wide range of test datasets.
arXiv Detail & Related papers (2021-03-24T23:11:32Z) - Dimension-agnostic inference using cross U-statistics [33.17951971728784]
We introduce an approach that uses variational representations of existing test statistics along with sample splitting and self-normalization.
The resulting statistic can be viewed as a careful modification of degenerate U-statistics, dropping diagonal blocks and retaining off-diagonal blocks.
arXiv Detail & Related papers (2020-11-10T12:21:34Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z) - Error bounds in estimating the out-of-sample prediction error using
leave-one-out cross validation in high-dimensions [19.439945058410203]
We study the problem of out-of-sample risk estimation in the high dimensional regime.
Extensive empirical evidence confirms the accuracy of leave-one-out cross validation.
One technical advantage of the theory is that it can be used to clarify and connect some results from the recent literature on scalable approximate LO.
arXiv Detail & Related papers (2020-03-03T20:07:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.