ALPCAH: Subspace Learning for Sample-wise Heteroscedastic Data
- URL: http://arxiv.org/abs/2505.07272v1
- Date: Mon, 12 May 2025 06:49:47 GMT
- Title: ALPCAH: Subspace Learning for Sample-wise Heteroscedastic Data
- Authors: Javier Salazar Cavazos, Jeffrey A. Fessler, Laura Balzano,
- Abstract summary: This paper develops a subspace learning method, named ALPCAH, that can estimate the sample-wise noise variances.<n>Our method makes no distributional assumptions of the low-rank component and does not assume that the noise variances are known.<n> Additionally, this paper develops a matrix factorized version of ALPCAH, named LR-ALPCAH, that is much faster and more memory efficient at the cost of requiring subspace dimension to be known or estimated.
- Score: 15.812312064457867
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction. However, some applications involve heterogeneous data that vary in quality due to noise characteristics associated with each data sample. Heteroscedastic methods aim to deal with such mixed data quality. This paper develops a subspace learning method, named ALPCAH, that can estimate the sample-wise noise variances and use this information to improve the estimate of the subspace basis associated with the low-rank structure of the data. Our method makes no distributional assumptions of the low-rank component and does not assume that the noise variances are known. Further, this method uses a soft rank constraint that does not require subspace dimension to be known. Additionally, this paper develops a matrix factorized version of ALPCAH, named LR-ALPCAH, that is much faster and more memory efficient at the cost of requiring subspace dimension to be known or estimated. Simulations and real data experiments show the effectiveness of accounting for data heteroscedasticity compared to existing algorithms. Code available at https://github.com/javiersc1/ALPCAH.
Related papers
- ALPCAHUS: Subspace Clustering for Heteroscedastic Data [15.812312064457867]
This paper develops a heteroscedastic-focused subspace clustering method, named ALPCAHUS.<n>It estimates the sample-wise noise variances and uses this information to improve the estimate of the subspace bases associated with the low-rank structure of the data.
arXiv Detail & Related papers (2025-05-25T00:56:08Z) - Data value estimation on private gradients [84.966853523107]
For gradient-based machine learning (ML) methods, the de facto differential privacy technique is perturbing the gradients with random noise.<n>Data valuation attributes the ML performance to the training data and is widely used in privacy-aware applications that require enforcing DP.<n>We show that the answer is no with the default approach of injecting i.i.d.random noise to the gradients because the estimation uncertainty of the data value estimation paradoxically linearly scales with more estimation budget.<n>We propose to instead inject carefully correlated noise to provably remove the linear scaling of estimation uncertainty w.r.t.the budget.
arXiv Detail & Related papers (2024-12-22T13:15:51Z) - ALPCAH: Sample-wise Heteroscedastic PCA with Tail Singular Value
Regularization [17.771454131646312]
Principal component analysis is a key tool in the field of data dimensionality reduction.
This paper develops a PCA method that can estimate the sample-wise noise variances.
It is done without distributional assumptions of the low-rank component and without assuming the noise variances are known.
arXiv Detail & Related papers (2023-07-06T03:11:11Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - Capturing the Denoising Effect of PCA via Compression Ratio [3.967854215226183]
Principal component analysis (PCA) is one of the most fundamental tools in machine learning.
In this paper, we propose a novel metric called emphcompression ratio to capture the effect of PCA on high-dimensional noisy data.
Building on this new metric, we design a straightforward algorithm that could be used to detect outliers.
arXiv Detail & Related papers (2022-04-22T18:43:47Z) - The Optimal Noise in Noise-Contrastive Learning Is Not What You Think [80.07065346699005]
We show that deviating from this assumption can actually lead to better statistical estimators.
In particular, the optimal noise distribution is different from the data's and even from a different family.
arXiv Detail & Related papers (2022-03-02T13:59:20Z) - A Robust and Flexible EM Algorithm for Mixtures of Elliptical
Distributions with Missing Data [71.9573352891936]
This paper tackles the problem of missing data imputation for noisy and non-Gaussian data.
A new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data.
Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data.
arXiv Detail & Related papers (2022-01-28T10:01:37Z) - Stochastic tensor space feature theory with applications to robust machine learning [3.6891975755608355]
We develop a Multilevel Orthogonal Subspace (MOS) Karhunen-Loeve feature theory based on tensor spaces.<n>Our key observation is that separate machine learning classes can reside predominantly in mostly distinct subspaces.<n>Tests in the blood plasma dataset (Alzheimer's Disease Neuroimaging Initiative) show dramatic increases in accuracy.
arXiv Detail & Related papers (2021-10-04T22:01:01Z) - Tensor Laplacian Regularized Low-Rank Representation for Non-uniformly
Distributed Data Subspace Clustering [2.578242050187029]
Low-Rank Representation (LRR) suffers from discarding the locality information of data points in subspace clustering.
We propose a hypergraph model to facilitate having a variable number of adjacent nodes and incorporating the locality information of the data.
Experiments on artificial and real datasets demonstrate the higher accuracy and precision of the proposed method.
arXiv Detail & Related papers (2021-03-06T08:22:24Z) - Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature
Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization.
We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - A Method for Handling Multi-class Imbalanced Data by Geometry based
Information Sampling and Class Prioritized Synthetic Data Generation (GICaPS) [15.433936272310952]
This paper looks into the problem of handling imbalanced data in a multi-label classification problem.
Two novel methods are proposed that exploit the geometric relationship between the feature vectors.
The efficacy of the proposed methods is analyzed by solving a generic multi-class recognition problem.
arXiv Detail & Related papers (2020-10-11T04:04:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.