Cellwise and Casewise Robust Covariance in High Dimensions
- URL: http://arxiv.org/abs/2505.19925v1
- Date: Mon, 26 May 2025 12:46:44 GMT
- Title: Cellwise and Casewise Robust Covariance in High Dimensions
- Authors: Fabio Centofanti, Mia Hubert, Peter J. Rousseeuw,
- Abstract summary: The cellRCov method simultaneously handles casewise outliers, cellwise outliers, and missing data.<n>A simulation study demonstrates the superior performance of cellRCov in contaminated and missing data scenarios.<n>We also construct and illustrate the cellRCCA method for robust and regularized canonical correlation analysis.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The sample covariance matrix is a cornerstone of multivariate statistics, but it is highly sensitive to outliers. These can be casewise outliers, such as cases belonging to a different population, or cellwise outliers, which are deviating cells (entries) of the data matrix. Recently some robust covariance estimators have been developed that can handle both types of outliers, but their computation is only feasible up to at most 20 dimensions. To remedy this we propose the cellRCov method, a robust covariance estimator that simultaneously handles casewise outliers, cellwise outliers, and missing data. It relies on a decomposition of the covariance on principal and orthogonal subspaces, leveraging recent work on robust PCA. It also employs a ridge-type regularization to stabilize the estimated covariance matrix. We establish some theoretical properties of cellRCov, including its casewise and cellwise influence functions as well as consistency and asymptotic normality. A simulation study demonstrates the superior performance of cellRCov in contaminated and missing data scenarios. Furthermore, its practical utility is illustrated in a real-world application to anomaly detection. We also construct and illustrate the cellRCCA method for robust and regularized canonical correlation analysis.
Related papers
- Unlasting: Unpaired Single-Cell Multi-Perturbation Estimation by Dual Conditional Diffusion Implicit Bridges [68.98973318553983]
We propose a framework based on Dual Diffusion Implicit Bridges (DDIB) to learn the mapping between different data distributions.<n>We integrate gene regulatory network (GRN) information to propagate perturbation signals in a biologically meaningful way.<n>We also incorporate a masking mechanism to predict silent genes, improving the quality of generated profiles.
arXiv Detail & Related papers (2025-06-26T09:05:38Z) - Robust Multilinear Principal Component Analysis [0.0]
Multilinear Principal Component Analysis (MPCA) is an important tool for analyzing tensor data.<n>Standard MPCA is sensitive to outliers.<n>This paper introduces a novel robust MPCA method that can handle both types of outliers simultaneously.
arXiv Detail & Related papers (2025-03-10T13:41:03Z) - Asymptotics of Linear Regression with Linearly Dependent Data [28.005935031887038]
We study the computations of linear regression in settings with non-Gaussian covariates.<n>We show how dependencies influence estimation error and the choice of regularization parameters.
arXiv Detail & Related papers (2024-12-04T20:31:47Z) - Induced Covariance for Causal Discovery in Linear Sparse Structures [55.2480439325792]
Causal models seek to unravel the cause-effect relationships among variables from observed data.
This paper introduces a novel causal discovery algorithm designed for settings in which variables exhibit linearly sparse relationships.
arXiv Detail & Related papers (2024-10-02T04:01:38Z) - High-dimensional logistic regression with missing data: Imputation, regularization, and universality [7.167672851569787]
We study high-dimensional, ridge-regularized logistic regression.
We provide exact characterizations of both the prediction error and the estimation error.
arXiv Detail & Related papers (2024-10-01T21:41:21Z) - Spectrum-Aware Debiasing: A Modern Inference Framework with Applications to Principal Components Regression [1.342834401139078]
We introduce SpectrumAware Debiasing, a novel method for high-dimensional regression.
Our approach applies to problems with structured, heavy tails, and low-rank structures.
We demonstrate our method through simulated and real data experiments.
arXiv Detail & Related papers (2023-09-14T15:58:30Z) - Learning Graphical Factor Models with Riemannian Optimization [70.13748170371889]
This paper proposes a flexible algorithmic framework for graph learning under low-rank structural constraints.
The problem is expressed as penalized maximum likelihood estimation of an elliptical distribution.
We leverage geometries of positive definite matrices and positive semi-definite matrices of fixed rank that are well suited to elliptical models.
arXiv Detail & Related papers (2022-10-21T13:19:45Z) - The Cellwise Minimum Covariance Determinant Estimator [1.90365714903665]
We propose a cellwise robust version of the MCD method, called cellMCD.
It performs well in simulations with cellwise outliers, and has high finite-sample efficiency on clean data.
It is illustrated with real data with visualizations of the results.
arXiv Detail & Related papers (2022-07-27T12:33:51Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature
Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization.
We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z) - A Robust Test for Elliptical Symmetry [2.030567625639093]
Ellipticity GoF tests are usually hard to analyze and often their statistical power is not particularly strong.
We develop a novel framework based on the exchangeable random variables calculus introduced by de Finetti.
arXiv Detail & Related papers (2020-06-05T08:51:16Z) - Covariance Estimation for Matrix-valued Data [9.739753590548796]
We propose a class of distribution-free regularized covariance estimation methods for high-dimensional matrix data.
We formulate a unified framework for estimating bandable covariance, and introduce an efficient algorithm based on rank one unconstrained Kronecker product approximation.
We demonstrate the superior finite-sample performance of our methods using simulations and real applications from a gridded temperature anomalies dataset and a S&P 500 stock data analysis.
arXiv Detail & Related papers (2020-04-11T02:15:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.