$k$-PCA for (non-squared) Euclidean Distances: Polynomial Time Approximation
- URL: http://arxiv.org/abs/2507.14631v1
- Date: Sat, 19 Jul 2025 14:00:50 GMT
- Title: $k$-PCA for (non-squared) Euclidean Distances: Polynomial Time Approximation
- Authors: Daniel Greenhut, Dan Feldman,
- Abstract summary: Given an integer $kgeq1$ and a set $P$ of $n$ points in $REALd$, the classic approximation $k$-PCA approximates affinemph$fty distances.<n>Open code and experimental results on real-world datasets are also provided.
- Score: 16.942733472657622
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Given an integer $k\geq1$ and a set $P$ of $n$ points in $\REAL^d$, the classic $k$-PCA (Principle Component Analysis) approximates the affine \emph{$k$-subspace mean} of $P$, which is the $k$-dimensional affine linear subspace that minimizes its sum of squared Euclidean distances ($\ell_{2,2}$-norm) over the points of $P$, i.e., the mean of these distances. The \emph{$k$-subspace median} is the subspace that minimizes its sum of (non-squared) Euclidean distances ($\ell_{2,1}$-mixed norm), i.e., their median. The median subspace is usually more sparse and robust to noise/outliers than the mean, but also much harder to approximate since, unlike the $\ell_{z,z}$ (non-mixed) norms, it is non-convex for $k<d-1$. We provide the first polynomial-time deterministic algorithm whose both running time and approximation factor are not exponential in $k$. More precisely, the multiplicative approximation factor is $\sqrt{d}$, and the running time is polynomial in the size of the input. We expect that our technique would be useful for many other related problems, such as $\ell_{2,z}$ norm of distances for $z\not \in \br{1,2}$, e.g., $z=\infty$, and handling outliers/sparsity. Open code and experimental results on real-world datasets are also provided.
Related papers
- Guessing Efficiently for Constrained Subspace Approximation [49.83981776254246]
We introduce a general framework for constrained subspace approximation.<n>We show it provides new algorithms for partition-constrained subspace approximation with applications to $k$-means clustering, and projected non-negative matrix factorization.
arXiv Detail & Related papers (2025-04-29T15:56:48Z) - Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch.
We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work.
We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z) - Parameterized Approximation for Robust Clustering in Discrete Geometric Spaces [2.687607197645453]
We show that even the special case of $k$-Center in dimension $Theta(log n)$ is $(sqrt3/2- o(1))$hard to approximate for FPT algorithms.
We also show that even the special case of $k$-Center in dimension $Theta(log n)$ is $(sqrt3/2- o(1))$hard to approximate for FPT algorithms.
arXiv Detail & Related papers (2023-05-12T08:43:28Z) - TURF: A Two-factor, Universal, Robust, Fast Distribution Learning
Algorithm [64.13217062232874]
One of its most powerful and successful modalities approximates every distribution to an $ell$ distance essentially at most a constant times larger than its closest $t$-piece degree-$d_$.
We provide a method that estimates this number near-optimally, hence helps approach the best possible approximation.
arXiv Detail & Related papers (2022-02-15T03:49:28Z) - Low-Rank Approximation with $1/\epsilon^{1/3}$ Matrix-Vector Products [58.05771390012827]
We study iterative methods based on Krylov subspaces for low-rank approximation under any Schatten-$p$ norm.
Our main result is an algorithm that uses only $tildeO(k/sqrtepsilon)$ matrix-vector products.
arXiv Detail & Related papers (2022-02-10T16:10:41Z) - Active Sampling for Linear Regression Beyond the $\ell_2$ Norm [70.49273459706546]
We study active sampling algorithms for linear regression, which aim to query only a small number of entries of a target vector.
We show that this dependence on $d$ is optimal, up to logarithmic factors.
We also provide the first total sensitivity upper bound $O(dmax1,p/2log2 n)$ for loss functions with at most degree $p$ growth.
arXiv Detail & Related papers (2021-11-09T00:20:01Z) - Spectral properties of sample covariance matrices arising from random
matrices with independent non identically distributed columns [50.053491972003656]
It was previously shown that the functionals $texttr(AR(z))$, for $R(z) = (frac1nXXT- zI_p)-1$ and $Ain mathcal M_p$ deterministic, have a standard deviation of order $O(|A|_* / sqrt n)$.
Here, we show that $|mathbb E[R(z)] - tilde R(z)|_F
arXiv Detail & Related papers (2021-09-06T14:21:43Z) - Locally Private $k$-Means Clustering with Constant Multiplicative
Approximation and Near-Optimal Additive Error [10.632986841188]
We bridge the gap between the exponents of $n$ in the upper and lower bounds on the additive error with two new algorithms.
It is possible to solve the locally private $k$-means problem in a constant number of rounds with constant factor multiplicative approximation.
arXiv Detail & Related papers (2021-05-31T14:41:40Z) - Sparse sketches with small inversion bias [79.77110958547695]
Inversion bias arises when averaging estimates of quantities that depend on the inverse covariance.
We develop a framework for analyzing inversion bias, based on our proposed concept of an $(epsilon,delta)$-unbiased estimator for random matrices.
We show that when the sketching matrix $S$ is dense and has i.i.d. sub-gaussian entries, the estimator $(epsilon,delta)$-unbiased for $(Atop A)-1$ with a sketch of size $m=O(d+sqrt d/
arXiv Detail & Related papers (2020-11-21T01:33:15Z) - Subspace approximation with outliers [6.186553186139257]
We show how to extend dimension reduction techniques and bi-criteria approximations based on sampling to the problem of subspace approximation with outliers.
Our results hold even when the fraction of outliers $alpha$ is large, as long as the obvious condition $0 delta leq 1 - alpha$ is satisfied.
arXiv Detail & Related papers (2020-06-30T07:22:33Z) - Sets Clustering [25.358415142404752]
We prove that a core-set of $O(logn)$ sets always exists, and can be computed in $O(nlogn)$ time.
Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS ($1+varepsilon$ approximation) for the sets-$k$-means problem.
Open source code and experimental results for document classification and facility locations are also provided.
arXiv Detail & Related papers (2020-03-09T13:30:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.