Related papers: $k$-PCA for (non-squared) Euclidean Distances: Polynomial Time Approximation

$k$-PCA for (non-squared) Euclidean Distances: Polynomial Time Approximation

URL: http://arxiv.org/abs/2507.14631v1
Date: Sat, 19 Jul 2025 14:00:50 GMT
Title: $k$-PCA for (non-squared) Euclidean Distances: Polynomial Time Approximation
Authors: Daniel Greenhut, Dan Feldman,
Abstract summary: Given an integer $kgeq1$ and a set $P$ of $n$ points in $REALd$, the classic approximation $k$-PCA approximates affinemph$fty distances.<n>Open code and experimental results on real-world datasets are also provided.
Score: 16.942733472657622
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Given an integer $k\geq1$ and a set $P$ of $n$ points in $\REAL^d$, the classic $k$-PCA (Principle Component Analysis) approximates the affine \emph{$k$-subspace mean} of $P$, which is the $k$-dimensional affine linear subspace that minimizes its sum of squared Euclidean distances ($\ell_{2,2}$-norm) over the points of $P$, i.e., the mean of these distances. The \emph{$k$-subspace median} is the subspace that minimizes its sum of (non-squared) Euclidean distances ($\ell_{2,1}$-mixed norm), i.e., their median. The median subspace is usually more sparse and robust to noise/outliers than the mean, but also much harder to approximate since, unlike the $\ell_{z,z}$ (non-mixed) norms, it is non-convex for $k<d-1$. We provide the first polynomial-time deterministic algorithm whose both running time and approximation factor are not exponential in $k$. More precisely, the multiplicative approximation factor is $\sqrt{d}$, and the running time is polynomial in the size of the input. We expect that our technique would be useful for many other related problems, such as $\ell_{2,z}$ norm of distances for $z\not \in \br{1,2}$, e.g., $z=\infty$, and handling outliers/sparsity. Open code and experimental results on real-world datasets are also provided.

Related papers

Guessing Efficiently for Constrained Subspace Approximation [49.83981776254246]
We introduce a general framework for constrained subspace approximation.<n>We show it provides new algorithms for partition-constrained subspace approximation with applications to $k$-means clustering, and projected non-negative matrix factorization.
arXiv Detail & Related papers (2025-04-29T15:56:48Z)
Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch. We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work. We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z)
Parameterized Approximation for Robust Clustering in Discrete Geometric Spaces [2.687607197645453]
We show that even the special case of $k$-Center in dimension $Theta(log n)$ is $(sqrt3/2- o(1))$hard to approximate for FPT algorithms. We also show that even the special case of $k$-Center in dimension $Theta(log n)$ is $(sqrt3/2- o(1))$hard to approximate for FPT algorithms.
arXiv Detail & Related papers (2023-05-12T08:43:28Z)
TURF: A Two-factor, Universal, Robust, Fast Distribution Learning Algorithm [64.13217062232874]
One of its most powerful and successful modalities approximates every distribution to an $ell$ distance essentially at most a constant times larger than its closest $t$-piece degree-$d_$. We provide a method that estimates this number near-optimally, hence helps approach the best possible approximation.
arXiv Detail & Related papers (2022-02-15T03:49:28Z)
Low-Rank Approximation with $1/\epsilon^{1/3}$ Matrix-Vector Products [58.05771390012827]
We study iterative methods based on Krylov subspaces for low-rank approximation under any Schatten-$p$ norm. Our main result is an algorithm that uses only $tildeO(k/sqrtepsilon)$ matrix-vector products.
arXiv Detail & Related papers (2022-02-10T16:10:41Z)
Active Sampling for Linear Regression Beyond the $\ell_2$ Norm [70.49273459706546]
We study active sampling algorithms for linear regression, which aim to query only a small number of entries of a target vector. We show that this dependence on $d$ is optimal, up to logarithmic factors. We also provide the first total sensitivity upper bound $O(dmax1,p/2log2 n)$ for loss functions with at most degree $p$ growth.
arXiv Detail & Related papers (2021-11-09T00:20:01Z)
Spectral properties of sample covariance matrices arising from random matrices with independent non identically distributed columns [50.053491972003656]
It was previously shown that the functionals $texttr(AR(z))$, for $R(z) = (frac1nXXT- zI_p)-1$ and $Ain mathcal M_p$ deterministic, have a standard deviation of order $O(|A|_* / sqrt n)$. Here, we show that $|mathbb E[R(z)] - tilde R(z)|_F
arXiv Detail & Related papers (2021-09-06T14:21:43Z)
Locally Private $k$-Means Clustering with Constant Multiplicative Approximation and Near-Optimal Additive Error [10.632986841188]
We bridge the gap between the exponents of $n$ in the upper and lower bounds on the additive error with two new algorithms. It is possible to solve the locally private $k$-means problem in a constant number of rounds with constant factor multiplicative approximation.
arXiv Detail & Related papers (2021-05-31T14:41:40Z)
Sparse sketches with small inversion bias [79.77110958547695]
Inversion bias arises when averaging estimates of quantities that depend on the inverse covariance. We develop a framework for analyzing inversion bias, based on our proposed concept of an $(epsilon,delta)$-unbiased estimator for random matrices. We show that when the sketching matrix $S$ is dense and has i.i.d. sub-gaussian entries, the estimator $(epsilon,delta)$-unbiased for $(Atop A)-1$ with a sketch of size $m=O(d+sqrt d/
arXiv Detail & Related papers (2020-11-21T01:33:15Z)
Subspace approximation with outliers [6.186553186139257]
We show how to extend dimension reduction techniques and bi-criteria approximations based on sampling to the problem of subspace approximation with outliers. Our results hold even when the fraction of outliers $alpha$ is large, as long as the obvious condition $0 delta leq 1 - alpha$ is satisfied.
arXiv Detail & Related papers (2020-06-30T07:22:33Z)
Sets Clustering [25.358415142404752]
We prove that a core-set of $O(logn)$ sets always exists, and can be computed in $O(nlogn)$ time. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS ($1+varepsilon$ approximation) for the sets-$k$-means problem. Open source code and experimental results for document classification and facility locations are also provided.
arXiv Detail & Related papers (2020-03-09T13:30:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.