Related papers: Improved Coresets for Euclidean $k$-Means

Improved Coresets for Euclidean $k$-Means

URL: http://arxiv.org/abs/2211.08184v2
Date: Wed, 16 Nov 2022 06:42:40 GMT
Title: Improved Coresets for Euclidean $k$-Means
Authors: Vincent Cohen-Addad and Kasper Green Larsen and David Saulpic and Chris Schwiegelshohn and Omar Ali Sheikh-Omar
Abstract summary: Given a set of $n$ points in $d$ dimensions, the Euclidean $k$-means problem (resp. the Euclidean $k$-median problem) consists of finding $k$ centers. In this paper, we improve the upper bounds $tilde O(min(k2 cdot varepsilon-2,kcdot varepsilon-4)$ for $k$-means and $tilde O(min(k4/3 cdot varepsilon
Score: 24.850829728643923
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Given a set of $n$ points in $d$ dimensions, the Euclidean $k$-means problem (resp. the Euclidean $k$-median problem) consists of finding $k$ centers such that the sum of squared distances (resp. sum of distances) from every point to its closest center is minimized. The arguably most popular way of dealing with this problem in the big data setting is to first compress the data by computing a weighted subset known as a coreset and then run any algorithm on this subset. The guarantee of the coreset is that for any candidate solution, the ratio between coreset cost and the cost of the original instance is less than a $(1\pm \varepsilon)$ factor. The current state of the art coreset size is $\tilde O(\min(k^{2} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-4}))$ for Euclidean $k$-means and $\tilde O(\min(k^{2} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-3}))$ for Euclidean $k$-median. The best known lower bound for both problems is $\Omega(k \varepsilon^{-2})$. In this paper, we improve the upper bounds $\tilde O(\min(k^{3/2} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-4}))$ for $k$-means and $\tilde O(\min(k^{4/3} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-3}))$ for $k$-median. In particular, ours is the first provable bound that breaks through the $k^2$ barrier while retaining an optimal dependency on $\varepsilon$.

Related papers

Guessing Efficiently for Constrained Subspace Approximation [49.83981776254246]
We introduce a general framework for constrained subspace approximation. We show it provides new algorithms for partition-constrained subspace approximation with applications to $k$-means clustering, and projected non-negative matrix factorization.
arXiv Detail & Related papers (2025-04-29T15:56:48Z)
A Tight VC-Dimension Analysis of Clustering Coresets with Applications [19.213536807823836]
We consider coresets for $k$-clustering problems, where the goal is to assign points to centers minimizing powers of distances. Given a point set $P$, a coreset $Omega$ is a small weighted subset that approximates the cost of $P$ for all candidate solutions. We obtain improved $k$-median coreset bounds for the following metrics.
arXiv Detail & Related papers (2025-01-11T17:00:57Z)
Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch. We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work. We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z)
Nearly Linear Sparsification of $\ell_p$ Subspace Approximation [47.790126028106734]
A popular approach to cope with the NP-hardness of the $ell_p$ subspace approximation problem is to compute a strong coreset. We obtain the first algorithm for constructing a strong coreset for $ell_p$ subspace approximation with a nearly optimal dependence on the rank parameter $k$. Our techniques also lead to the first nearly optimal online strong coresets for $ell_p$ subspace approximation with similar bounds as the offline setting.
arXiv Detail & Related papers (2024-07-03T16:49:28Z)
Efficient Continual Finite-Sum Minimization [52.5238287567572]
We propose a key twist into the finite-sum minimization, dubbed as continual finite-sum minimization. Our approach significantly improves upon the $mathcalO(n/epsilon)$ FOs that $mathrmStochasticGradientDescent$ requires. We also prove that there is no natural first-order method with $mathcalOleft(n/epsilonalpharight)$ complexity gradient for $alpha 1/4$, establishing that the first-order complexity of our method is nearly tight.
arXiv Detail & Related papers (2024-06-07T08:26:31Z)
Coresets for Multiple $\ell_p$ Regression [47.790126028106734]
We construct coresets of size $tilde O(varepsilon-2d)$ for $p2$ and $tilde O(varepsilon-pdp/2)$ for $p>2$. For $1p2$, every matrix has a subset of $tilde O(varepsilon-1k)$ rows which spans a $(varepsilon-1k)$-approximately optimal $k$-dimensional subspace for $ell_p$ subspace approximation
arXiv Detail & Related papers (2024-06-04T15:50:42Z)
$\ell_p$-Regression in the Arbitrary Partition Model of Communication [59.89387020011663]
We consider the randomized communication complexity of the distributed $ell_p$-regression problem in the coordinator model. For $p = 2$, i.e., least squares regression, we give the first optimal bound of $tildeTheta(sd2 + sd/epsilon)$ bits. For $p in (1,2)$,we obtain an $tildeO(sd2/epsilon + sd/mathrmpoly(epsilon)$ upper bound.
arXiv Detail & Related papers (2023-07-11T08:51:53Z)
Towards Optimal Lower Bounds for k-median and k-means Coresets [25.713987341159918]
Given a set of points in a metric space, the $(k,z)$-clustering problem consists of finding a set of $k$ points called centers. We show that any coreset for $(k,z)$ clustering must consist of at least $Omega(k varepsilon-2 log n)$ and $Omega(k varepsilon-2 D)$ points.
arXiv Detail & Related papers (2022-02-25T16:13:28Z)
Low-Rank Approximation with $1/\epsilon^{1/3}$ Matrix-Vector Products [58.05771390012827]
We study iterative methods based on Krylov subspaces for low-rank approximation under any Schatten-$p$ norm. Our main result is an algorithm that uses only $tildeO(k/sqrtepsilon)$ matrix-vector products.
arXiv Detail & Related papers (2022-02-10T16:10:41Z)
Approximate Maximum Halfspace Discrepancy [6.35821487778241]
We consider the range space $(X, mathcalH_d)$ where $X subset mathbbRd$ and $mathcalH_d$ is the set of ranges defined by $d$ halfspaces. For each halfspace $h in mathcalH_d$ define a function $Phi(h)$ that measures the "difference" between the fraction of red and fraction of blue points which fall in the range $h$.
arXiv Detail & Related papers (2021-06-25T19:14:45Z)
Sets Clustering [25.358415142404752]
We prove that a core-set of $O(logn)$ sets always exists, and can be computed in $O(nlogn)$ time. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS ($1+varepsilon$ approximation) for the sets-$k$-means problem. Open source code and experimental results for document classification and facility locations are also provided.
arXiv Detail & Related papers (2020-03-09T13:30:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.