A Tight VC-Dimension Analysis of Clustering Coresets with Applications
- URL: http://arxiv.org/abs/2501.06588v1
- Date: Sat, 11 Jan 2025 17:00:57 GMT
- Title: A Tight VC-Dimension Analysis of Clustering Coresets with Applications
- Authors: Vincent Cohen-Addad, Andrew Draganov, Matteo Russo, David Saulpic, Chris Schwiegelshohn,
- Abstract summary: We consider coresets for $k$-clustering problems, where the goal is to assign points to centers minimizing powers of distances.
Given a point set $P$, a coreset $Omega$ is a small weighted subset that approximates the cost of $P$ for all candidate solutions.
We obtain improved $k$-median coreset bounds for the following metrics.
- Score: 19.213536807823836
- License:
- Abstract: We consider coresets for $k$-clustering problems, where the goal is to assign points to centers minimizing powers of distances. A popular example is the $k$-median objective $\sum_{p}\min_{c\in C}dist(p,C)$. Given a point set $P$, a coreset $\Omega$ is a small weighted subset that approximates the cost of $P$ for all candidate solutions $C$ up to a $(1\pm\varepsilon )$ multiplicative factor. In this paper, we give a sharp VC-dimension based analysis for coreset construction. As a consequence, we obtain improved $k$-median coreset bounds for the following metrics: Coresets of size $\tilde{O}\left(k\varepsilon^{-2}\right)$ for shortest path metrics in planar graphs, improving over the bounds $\tilde{O}\left(k\varepsilon^{-6}\right)$ by [Cohen-Addad, Saulpic, Schwiegelshohn, STOC'21] and $\tilde{O}\left(k^2\varepsilon^{-4}\right)$ by [Braverman, Jiang, Krauthgamer, Wu, SODA'21]. Coresets of size $\tilde{O}\left(kd\ell\varepsilon^{-2}\log m\right)$ for clustering $d$-dimensional polygonal curves of length at most $m$ with curves of length at most $\ell$ with respect to Frechet metrics, improving over the bounds $\tilde{O}\left(k^3d\ell\varepsilon^{-3}\log m\right)$ by [Braverman, Cohen-Addad, Jiang, Krauthgamer, Schwiegelshohn, Toftrup, and Wu, FOCS'22] and $\tilde{O}\left(k^2d\ell\varepsilon^{-2}\log m \log |P|\right)$ by [Conradi, Kolbe, Psarros, Rohde, SoCG'24].
Related papers
- Nearly Linear Sparsification of $\ell_p$ Subspace Approximation [47.790126028106734]
A popular approach to cope with the NP-hardness of the $ell_p$ subspace approximation problem is to compute a strong coreset.
We obtain the first algorithm for constructing a strong coreset for $ell_p$ subspace approximation with a nearly optimal dependence on the rank parameter $k$.
Our techniques also lead to the first nearly optimal online strong coresets for $ell_p$ subspace approximation with similar bounds as the offline setting.
arXiv Detail & Related papers (2024-07-03T16:49:28Z) - Coresets for Multiple $\ell_p$ Regression [47.790126028106734]
We construct coresets of size $tilde O(varepsilon-2d)$ for $p2$ and $tilde O(varepsilon-pdp/2)$ for $p>2$.
For $1p2$, every matrix has a subset of $tilde O(varepsilon-1k)$ rows which spans a $(varepsilon-1k)$-approximately optimal $k$-dimensional subspace for $ell_p$ subspace approximation
arXiv Detail & Related papers (2024-06-04T15:50:42Z) - $\ell_p$-Regression in the Arbitrary Partition Model of Communication [59.89387020011663]
We consider the randomized communication complexity of the distributed $ell_p$-regression problem in the coordinator model.
For $p = 2$, i.e., least squares regression, we give the first optimal bound of $tildeTheta(sd2 + sd/epsilon)$ bits.
For $p in (1,2)$,we obtain an $tildeO(sd2/epsilon + sd/mathrmpoly(epsilon)$ upper bound.
arXiv Detail & Related papers (2023-07-11T08:51:53Z) - Improved Coresets for Euclidean $k$-Means [24.850829728643923]
Given a set of $n$ points in $d$ dimensions, the Euclidean $k$-means problem (resp. the Euclidean $k$-median problem) consists of finding $k$ centers.
In this paper, we improve the upper bounds $tilde O(min(k2 cdot varepsilon-2,kcdot varepsilon-4)$ for $k$-means and $tilde O(min(k4/3 cdot varepsilon
arXiv Detail & Related papers (2022-11-15T14:47:24Z) - Towards Optimal Lower Bounds for k-median and k-means Coresets [25.713987341159918]
Given a set of points in a metric space, the $(k,z)$-clustering problem consists of finding a set of $k$ points called centers.
We show that any coreset for $(k,z)$ clustering must consist of at least $Omega(k varepsilon-2 log n)$ and $Omega(k varepsilon-2 D)$ points.
arXiv Detail & Related papers (2022-02-25T16:13:28Z) - Low-Rank Approximation with $1/\epsilon^{1/3}$ Matrix-Vector Products [58.05771390012827]
We study iterative methods based on Krylov subspaces for low-rank approximation under any Schatten-$p$ norm.
Our main result is an algorithm that uses only $tildeO(k/sqrtepsilon)$ matrix-vector products.
arXiv Detail & Related papers (2022-02-10T16:10:41Z) - Nearly Horizon-Free Offline Reinforcement Learning [97.36751930393245]
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes with $S$ states, $A$ actions and planning horizon $H$.
We obtain the first set of nearly $H$-free sample complexity bounds for evaluation and planning using the empirical MDPs.
arXiv Detail & Related papers (2021-03-25T18:52:17Z) - Optimal Coreset for Gaussian Kernel Density Estimation [0.8376091455761259]
Given a point set $Psubset mathbbRd$, the kernel density estimate of $P$ is defined as [ overlinemathcalG_P(x) = frac1left|Pright|sum_pin Pe-leftlVert x-p rightrVert2 ] for any $xinmathbbRd$.
We study how to construct a small subset $Q$ of $P
arXiv Detail & Related papers (2020-07-15T22:58:50Z) - Sets Clustering [25.358415142404752]
We prove that a core-set of $O(logn)$ sets always exists, and can be computed in $O(nlogn)$ time.
Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS ($1+varepsilon$ approximation) for the sets-$k$-means problem.
Open source code and experimental results for document classification and facility locations are also provided.
arXiv Detail & Related papers (2020-03-09T13:30:30Z) - Fixed-Support Wasserstein Barycenters: Computational Hardness and Fast
Algorithm [100.11971836788437]
We study the fixed-support Wasserstein barycenter problem (FS-WBP)
We develop a provably fast textitdeterministic variant of the celebrated iterative Bregman projection (IBP) algorithm, named textscFastIBP.
arXiv Detail & Related papers (2020-02-12T03:40:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.