Stacked SVD or SVD stacked? A Random Matrix Theory perspective on data integration
- URL: http://arxiv.org/abs/2507.22170v1
- Date: Tue, 29 Jul 2025 19:03:01 GMT
- Title: Stacked SVD or SVD stacked? A Random Matrix Theory perspective on data integration
- Authors: Tavor Z. Baharav, Phillip B. Nicol, Rafael A. Irizarry, Rong Ma,
- Abstract summary: Modern data analysis increasingly requires identifying shared latent structure across multiple high-dimensional datasets.<n>Two primary methods have emerged for estimating this shared structure, which vary in how they integrate information across datasets.<n>We develop exact expressions for the performance and phase transitions of these two methods and develop optimal weighting schemes to further improve both methods.
- Score: 7.304283080560899
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern data analysis increasingly requires identifying shared latent structure across multiple high-dimensional datasets. A commonly used model assumes that the data matrices are noisy observations of low-rank matrices with a shared singular subspace. In this case, two primary methods have emerged for estimating this shared structure, which vary in how they integrate information across datasets. The first approach, termed Stack-SVD, concatenates all the datasets, and then performs a singular value decomposition (SVD). The second approach, termed SVD-Stack, first performs an SVD separately for each dataset, then aggregates the top singular vectors across these datasets, and finally computes a consensus amongst them. While these methods are widely used, they have not been rigorously studied in the proportional asymptotic regime, which is of great practical relevance in today's world of increasing data size and dimensionality. This lack of theoretical understanding has led to uncertainty about which method to choose and limited the ability to fully exploit their potential. To address these challenges, we derive exact expressions for the asymptotic performance and phase transitions of these two methods and develop optimal weighting schemes to further improve both methods. Our analysis reveals that while neither method uniformly dominates the other in the unweighted case, optimally weighted Stack-SVD dominates optimally weighted SVD-Stack. We extend our analysis to accommodate multiple shared components, and provide practical algorithms for estimating optimal weights from data, offering theoretical guidance for method selection in practical data integration problems. Extensive numerical simulations and semi-synthetic experiments on genomic data corroborate our theoretical findings.
Related papers
- Optimal Estimation of Shared Singular Subspaces across Multiple Noisy Matrices [3.3373545585860596]
This study focuses on estimating shared (left) singular subspaces across multiple matrices within a low-rank matrix denoising framework.
We establish that Stack-SVD achieves minimax rate-optimality when the true singular subspaces of the signal matrices are identical.
For various cases of partial sharing, we rigorously characterize the conditions under which Stack-SVD remains effective, achieves minimax optimality, or fails to deliver consistent estimates.
arXiv Detail & Related papers (2024-11-26T02:49:30Z) - Entropic Optimal Transport Eigenmaps for Nonlinear Alignment and Joint Embedding of High-Dimensional Datasets [11.105392318582677]
We propose a principled approach for aligning and jointly embedding a pair of datasets with theoretical guarantees.
Our approach leverages the leading singular vectors of the EOT plan matrix between two datasets to extract their shared underlying structure.
We show that in a high-dimensional regime, the EOT plan recovers the shared manifold structure by approximating a kernel function evaluated at the locations of the latent variables.
arXiv Detail & Related papers (2024-07-01T18:48:55Z) - Robust SVD Made Easy: A fast and reliable algorithm for large-scale data
analysis [0.0]
Existing robust SVD algorithms often sacrifice speed for robustness or fail in the presence of only a few outliers.
This study introduces an efficient algorithm, called Spherically Normalized SVD, for robust SVD approximation.
The proposed algorithm achieves remarkable speed by utilizing only two applications of a standard reduced-rank SVD algorithm.
arXiv Detail & Related papers (2024-02-15T07:08:11Z) - Synergistic eigenanalysis of covariance and Hessian matrices for enhanced binary classification [72.77513633290056]
We present a novel approach that combines the eigenanalysis of a covariance matrix evaluated on a training set with a Hessian matrix evaluated on a deep learning model.
Our method captures intricate patterns and relationships, enhancing classification performance.
arXiv Detail & Related papers (2024-02-14T16:10:42Z) - Joint Distributional Learning via Cramer-Wold Distance [0.7614628596146602]
We introduce the Cramer-Wold distance regularization, which can be computed in a closed-form, to facilitate joint distributional learning for high-dimensional datasets.
We also introduce a two-step learning method to enable flexible prior modeling and improve the alignment between the aggregated posterior and the prior distribution.
arXiv Detail & Related papers (2023-10-25T05:24:23Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank.
Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z) - Why Approximate Matrix Square Root Outperforms Accurate SVD in Global
Covariance Pooling? [59.820507600960745]
We propose a new GCP meta-layer that uses SVD in the forward pass, and Pad'e Approximants in the backward propagation to compute the gradients.
The proposed meta-layer has been integrated into different CNN models and achieves state-of-the-art performances on both large-scale and fine-grained datasets.
arXiv Detail & Related papers (2021-05-06T08:03:45Z) - Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature
Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization.
We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z) - Two-Dimensional Semi-Nonnegative Matrix Factorization for Clustering [50.43424130281065]
We propose a new Semi-Nonnegative Matrix Factorization method for 2-dimensional (2D) data, named TS-NMF.
It overcomes the drawback of existing methods that seriously damage the spatial information of the data by converting 2D data to vectors in a preprocessing step.
arXiv Detail & Related papers (2020-05-19T05:54:14Z) - Distributed Bayesian Matrix Decomposition for Big Data Mining and
Clustering [13.491022200305824]
We propose a distributed matrix decomposition model for big data mining and clustering.
Specifically, we adopt three strategies to implement the distributed computing including 1) the accelerated gradient descent, 2) the alternating direction method of multipliers (ADMM), and 3) the statistical inference.
Our algorithms scale up well to big data and achieves superior or competing performance compared to other distributed methods.
arXiv Detail & Related papers (2020-02-10T13:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.