Coreset for Robust Geometric Median: Eliminating Size Dependency on Outliers
- URL: http://arxiv.org/abs/2510.24621v1
- Date: Tue, 28 Oct 2025 16:49:03 GMT
- Title: Coreset for Robust Geometric Median: Eliminating Size Dependency on Outliers
- Authors: Ziyi Fang, Lingxiao Huang, Runkai Yang,
- Abstract summary: We construct a coreset of size $tildeO(varepsilon-2 cdot minvarepsilon-2, d)$ when $n geq 4m$.<n>For the special case of $d = 1$, we achieve an optimal coreset size of $tildeTheta(varepsilon-1/2 + fracmn varepsilon-1)$.<n>Our results extend to robust $(k,z)$-clustering
- Score: 9.19267739465056
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the robust geometric median problem in Euclidean space $\mathbb{R}^d$, with a focus on coreset construction.A coreset is a compact summary of a dataset $P$ of size $n$ that approximates the robust cost for all centers $c$ within a multiplicative error $\varepsilon$. Given an outlier count $m$, we construct a coreset of size $\tilde{O}(\varepsilon^{-2} \cdot \min\{\varepsilon^{-2}, d\})$ when $n \geq 4m$, eliminating the $O(m)$ dependency present in prior work [Huang et al., 2022 & 2023]. For the special case of $d = 1$, we achieve an optimal coreset size of $\tilde{\Theta}(\varepsilon^{-1/2} + \frac{m}{n} \varepsilon^{-1})$, revealing a clear separation from the vanilla case studied in [Huang et al., 2023; Afshani and Chris, 2024]. Our results further extend to robust $(k,z)$-clustering in various metric spaces, eliminating the $m$-dependence under mild data assumptions. The key technical contribution is a novel non-component-wise error analysis, enabling substantial reduction of outlier influence, unlike prior methods that retain them.Empirically, our algorithms consistently outperform existing baselines in terms of size-accuracy tradeoffs and runtime, even when data assumptions are violated across a wide range of datasets.
Related papers
- Computational Hardness of Private Coreset [84.99100741615423]
For a given input set of points, a coreset is another set of points such that the $k$-means objective for any candidate solution is preserved up to a multiplicative $(, 1/n(1))$ factor (and some additive factor)<n>We show that no-time $(, 1/n(1))$-DP algorithm can compute a coreset for $k$-means in the $ell_infty$-metric for some constant $> 0$ (and some constant additive factor)<n>For $k$-means in the
arXiv Detail & Related papers (2026-02-19T15:58:49Z) - Linear time small coresets for k-mean clustering of segments with applications [4.759823735082844]
We study the $k$-means problem for a set $mathcalS subseteq mathbbRd$ of $n$ segments.<n>For any $varepsilon > 0$, an $varepsilon$-coreset is a weighted subset $C subseteq mathbbRd$ that approximates $D(mathcalS,X)$ within a factor of $1 pm varepsilon$ for any set of $k$ centers.
arXiv Detail & Related papers (2025-11-16T11:48:55Z) - Sublinear Time Quantum Sensitivity Sampling [57.356528942341534]
We present a unified framework for quantum sensitivity sampling, extending the advantages of quantum computing to a broad class of classical approximation problems.<n>Our framework provides a streamlined approach for constructing coresets and offers significant runtime improvements in applications such as clustering, regression, and low-rank approximation.
arXiv Detail & Related papers (2025-09-20T20:18:49Z) - Guessing Efficiently for Constrained Subspace Approximation [49.83981776254246]
We introduce a general framework for constrained subspace approximation.<n>We show it provides new algorithms for partition-constrained subspace approximation with applications to $k$-means clustering, and projected non-negative matrix factorization.
arXiv Detail & Related papers (2025-04-29T15:56:48Z) - Ridge Leverage Score Sampling for $\ell_p$ Subspace Approximation [47.790126028106734]
A popular approach to cope with the NP-hardness is to compute a strong coreset.<n>We obtain an algorithm for constructing a strong coreset for $ell_p$ subspace approximation of size $tilde O(kepsilon-4/p)$ for $p2$ and $tilde O(kp/2epsilon-p)$ for $p>2$.
arXiv Detail & Related papers (2024-07-03T16:49:28Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Data Structures for Density Estimation [66.36971978162461]
Given a sublinear (in $n$) number of samples from $p$, our main result is the first data structure that identifies $v_i$ in time sublinear in $k$.
We also give an improved version of the algorithm of Acharya et al. that reports $v_i$ in time linear in $k$.
arXiv Detail & Related papers (2023-06-20T06:13:56Z) - Parameterized Approximation for Robust Clustering in Discrete Geometric Spaces [2.687607197645453]
We show that even the special case of $k$-Center in dimension $Theta(log n)$ is $(sqrt3/2- o(1))$hard to approximate for FPT algorithms.
We also show that even the special case of $k$-Center in dimension $Theta(log n)$ is $(sqrt3/2- o(1))$hard to approximate for FPT algorithms.
arXiv Detail & Related papers (2023-05-12T08:43:28Z) - New Coresets for Projective Clustering and Applications [34.82221047030618]
Given a set of points $P$ in $mathbbRd$, the goal is to find $k$ flats of dimension $j$, i.e., affine subspaces, that best fit $P$ under a given distance measure.
We show that our construction provides efficient coreset constructions for Cauchy, Welsch, Huber, Geman-McClure, Tukey, $L_infty$, and Fair regression.
arXiv Detail & Related papers (2022-03-08T19:50:27Z) - Towards Optimal Lower Bounds for k-median and k-means Coresets [25.713987341159918]
Given a set of points in a metric space, the $(k,z)$-clustering problem consists of finding a set of $k$ points called centers.
We show that any coreset for $(k,z)$ clustering must consist of at least $Omega(k varepsilon-2 log n)$ and $Omega(k varepsilon-2 D)$ points.
arXiv Detail & Related papers (2022-02-25T16:13:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.