Improved Coresets for Euclidean $k$-Means
- URL: http://arxiv.org/abs/2211.08184v2
- Date: Wed, 16 Nov 2022 06:42:40 GMT
- Title: Improved Coresets for Euclidean $k$-Means
- Authors: Vincent Cohen-Addad and Kasper Green Larsen and David Saulpic and
Chris Schwiegelshohn and Omar Ali Sheikh-Omar
- Abstract summary: Given a set of $n$ points in $d$ dimensions, the Euclidean $k$-means problem (resp. the Euclidean $k$-median problem) consists of finding $k$ centers.
In this paper, we improve the upper bounds $tilde O(min(k2 cdot varepsilon-2,kcdot varepsilon-4)$ for $k$-means and $tilde O(min(k4/3 cdot varepsilon
- Score: 24.850829728643923
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given a set of $n$ points in $d$ dimensions, the Euclidean $k$-means problem
(resp. the Euclidean $k$-median problem) consists of finding $k$ centers such
that the sum of squared distances (resp. sum of distances) from every point to
its closest center is minimized. The arguably most popular way of dealing with
this problem in the big data setting is to first compress the data by computing
a weighted subset known as a coreset and then run any algorithm on this subset.
The guarantee of the coreset is that for any candidate solution, the ratio
between coreset cost and the cost of the original instance is less than a
$(1\pm \varepsilon)$ factor. The current state of the art coreset size is
$\tilde O(\min(k^{2} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-4}))$ for
Euclidean $k$-means and $\tilde O(\min(k^{2} \cdot \varepsilon^{-2},k\cdot
\varepsilon^{-3}))$ for Euclidean $k$-median. The best known lower bound for
both problems is $\Omega(k \varepsilon^{-2})$. In this paper, we improve the
upper bounds $\tilde O(\min(k^{3/2} \cdot \varepsilon^{-2},k\cdot
\varepsilon^{-4}))$ for $k$-means and $\tilde O(\min(k^{4/3} \cdot
\varepsilon^{-2},k\cdot \varepsilon^{-3}))$ for $k$-median. In particular, ours
is the first provable bound that breaks through the $k^2$ barrier while
retaining an optimal dependency on $\varepsilon$.
Related papers
- Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch.
We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work.
We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z) - Nearly Linear Sparsification of $\ell_p$ Subspace Approximation [47.790126028106734]
A popular approach to cope with the NP-hardness of the $ell_p$ subspace approximation problem is to compute a strong coreset.
We obtain the first algorithm for constructing a strong coreset for $ell_p$ subspace approximation with a nearly optimal dependence on the rank parameter $k$.
Our techniques also lead to the first nearly optimal online strong coresets for $ell_p$ subspace approximation with similar bounds as the offline setting.
arXiv Detail & Related papers (2024-07-03T16:49:28Z) - Efficient Continual Finite-Sum Minimization [52.5238287567572]
We propose a key twist into the finite-sum minimization, dubbed as continual finite-sum minimization.
Our approach significantly improves upon the $mathcalO(n/epsilon)$ FOs that $mathrmStochasticGradientDescent$ requires.
We also prove that there is no natural first-order method with $mathcalOleft(n/epsilonalpharight)$ complexity gradient for $alpha 1/4$, establishing that the first-order complexity of our method is nearly tight.
arXiv Detail & Related papers (2024-06-07T08:26:31Z) - Coresets for Multiple $\ell_p$ Regression [47.790126028106734]
We construct coresets of size $tilde O(varepsilon-2d)$ for $p2$ and $tilde O(varepsilon-pdp/2)$ for $p>2$.
For $1p2$, every matrix has a subset of $tilde O(varepsilon-1k)$ rows which spans a $(varepsilon-1k)$-approximately optimal $k$-dimensional subspace for $ell_p$ subspace approximation
arXiv Detail & Related papers (2024-06-04T15:50:42Z) - $\ell_p$-Regression in the Arbitrary Partition Model of Communication [59.89387020011663]
We consider the randomized communication complexity of the distributed $ell_p$-regression problem in the coordinator model.
For $p = 2$, i.e., least squares regression, we give the first optimal bound of $tildeTheta(sd2 + sd/epsilon)$ bits.
For $p in (1,2)$,we obtain an $tildeO(sd2/epsilon + sd/mathrmpoly(epsilon)$ upper bound.
arXiv Detail & Related papers (2023-07-11T08:51:53Z) - Towards Optimal Lower Bounds for k-median and k-means Coresets [25.713987341159918]
Given a set of points in a metric space, the $(k,z)$-clustering problem consists of finding a set of $k$ points called centers.
We show that any coreset for $(k,z)$ clustering must consist of at least $Omega(k varepsilon-2 log n)$ and $Omega(k varepsilon-2 D)$ points.
arXiv Detail & Related papers (2022-02-25T16:13:28Z) - Low-Rank Approximation with $1/\epsilon^{1/3}$ Matrix-Vector Products [58.05771390012827]
We study iterative methods based on Krylov subspaces for low-rank approximation under any Schatten-$p$ norm.
Our main result is an algorithm that uses only $tildeO(k/sqrtepsilon)$ matrix-vector products.
arXiv Detail & Related papers (2022-02-10T16:10:41Z) - Approximate Maximum Halfspace Discrepancy [6.35821487778241]
We consider the range space $(X, mathcalH_d)$ where $X subset mathbbRd$ and $mathcalH_d$ is the set of ranges defined by $d$ halfspaces.
For each halfspace $h in mathcalH_d$ define a function $Phi(h)$ that measures the "difference" between the fraction of red and fraction of blue points which fall in the range $h$.
arXiv Detail & Related papers (2021-06-25T19:14:45Z) - Sets Clustering [25.358415142404752]
We prove that a core-set of $O(logn)$ sets always exists, and can be computed in $O(nlogn)$ time.
Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS ($1+varepsilon$ approximation) for the sets-$k$-means problem.
Open source code and experimental results for document classification and facility locations are also provided.
arXiv Detail & Related papers (2020-03-09T13:30:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.