Universal Weak Coreset
- URL: http://arxiv.org/abs/2305.16890v1
- Date: Fri, 26 May 2023 12:51:16 GMT
- Title: Universal Weak Coreset
- Authors: Ragesh Jaiswal and Amit Kumar
- Abstract summary: A weak coreset is a pair $(J,S)$ of subsets of points, where $S$ acts as a summary of the point set and $J$ as a set of potential centers.
We develop this framework, which we call universal weak coresets, for constrained clustering settings.
- Score: 3.1509756165776635
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Coresets for $k$-means and $k$-median problems yield a small summary of the
data, which preserve the clustering cost with respect to any set of $k$
centers. Recently coresets have also been constructed for constrained $k$-means
and $k$-median problems. However, the notion of coresets has the drawback that
(i) they can only be applied in settings where the input points are allowed to
have weights, and (ii) in general metric spaces, the size of the coresets can
depend logarithmically on the number of points. The notion of weak coresets,
which have less stringent requirements than coresets, has been studied in the
context of classical $k$-means and $k$-median problems. A weak coreset is a
pair $(J,S)$ of subsets of points, where $S$ acts as a summary of the point set
and $J$ as a set of potential centers. This pair satisfies the properties that
(i) $S$ is a good summary of the data as long as the $k$ centers are chosen
from $J$ only, and (ii) there is a good choice of $k$ centers in $J$ with cost
close to the optimal cost. We develop this framework, which we call universal
weak coresets, for constrained clustering settings. In conjunction with recent
coreset constructions for constrained settings, our designs give greater data
compression, are conceptually simpler, and apply to a wide range of constrained
$k$-median and $k$-means problems.
Related papers
- Relax and Merge: A Simple Yet Effective Framework for Solving Fair $k$-Means and $k$-sparse Wasserstein Barycenter Problems [8.74967598360817]
Given a dataset comprising several groups, the fairness constraint requires that each cluster should contain a proportion of points from each group.
We propose a novel Relax and Merge'' framework, where $rho$ is the approximate ratio of an off-the-shelf vanilla $k$-means algorithm.
If equipped with a PTAS of $k$-means, our solution can achieve an approximation ratio of $(5+O(epsilon))$ with only a slight violation of the fairness constraints.
arXiv Detail & Related papers (2024-11-02T02:50:12Z) - Clustering to Minimize Cluster-Aware Norm Objectives [0.3481985817302898]
We seek to partition a given set $P$ of data points into $k$ clusters by finding a set $X$ of $k$ centers.
The cost of a cluster, represented by a center $xin X$, is a monotone, symmetric norm $f$ (inner norm) of the vector of distances of points assigned to $x$.
The goal is to minimize a norm $g$ (outer norm) of the vector of cluster costs.
arXiv Detail & Related papers (2024-10-31T16:33:40Z) - Optimal level set estimation for non-parametric tournament and crowdsourcing problems [49.75262185577198]
Motivated by crowdsourcing, we consider a problem where we partially observe the correctness of the answers of $n$ experts on $d$ questions.
In this paper, we assume that the matrix $M$ containing the probability that expert $i$ answers correctly to question $j$ is bi-isotonic up to a permutation of it rows and columns.
We construct an efficient-time algorithm that turns out to be minimax optimal for this classification problem.
arXiv Detail & Related papers (2024-08-27T18:28:31Z) - Nearly Linear Sparsification of $\ell_p$ Subspace Approximation [47.790126028106734]
A popular approach to cope with the NP-hardness of the $ell_p$ subspace approximation problem is to compute a strong coreset.
We obtain the first algorithm for constructing a strong coreset for $ell_p$ subspace approximation with a nearly optimal dependence on the rank parameter $k$.
Our techniques also lead to the first nearly optimal online strong coresets for $ell_p$ subspace approximation with similar bounds as the offline setting.
arXiv Detail & Related papers (2024-07-03T16:49:28Z) - A Unified Framework for Gradient-based Clustering of Distributed Data [51.904327888475606]
We develop a family of distributed clustering algorithms that work over networks of users.
DGC-$mathcalF_rho$ is specialized to popular clustering losses like $K$-means and Huber loss.
We show that consensus fixed points of DGC-$mathcalF_rho$ are equivalent to fixed points of gradient clustering over the full data.
arXiv Detail & Related papers (2024-02-02T10:44:42Z) - New Coresets for Projective Clustering and Applications [34.82221047030618]
Given a set of points $P$ in $mathbbRd$, the goal is to find $k$ flats of dimension $j$, i.e., affine subspaces, that best fit $P$ under a given distance measure.
We show that our construction provides efficient coreset constructions for Cauchy, Welsch, Huber, Geman-McClure, Tukey, $L_infty$, and Fair regression.
arXiv Detail & Related papers (2022-03-08T19:50:27Z) - Improved Search of Relevant Points for Nearest-Neighbor Classification [91.3755431537592]
We say a training point is relevant if its omission from the training set could induce the misclassification of some query point in $mathbbRd$.
An output-sensitive algorithm was proposed to find the set of border points of $P$ in $O( n2 + nk2 )$ time, where $k$ is the size of such set.
In this paper, we improve this algorithm to have time complexity equal to $O( nk2 )$ by proving that the first steps of their algorithm, which require $O( n2
arXiv Detail & Related papers (2022-03-07T18:10:27Z) - Towards Optimal Lower Bounds for k-median and k-means Coresets [25.713987341159918]
Given a set of points in a metric space, the $(k,z)$-clustering problem consists of finding a set of $k$ points called centers.
We show that any coreset for $(k,z)$ clustering must consist of at least $Omega(k varepsilon-2 log n)$ and $Omega(k varepsilon-2 D)$ points.
arXiv Detail & Related papers (2022-02-25T16:13:28Z) - FriendlyCore: Practical Differentially Private Aggregation [67.04951703461657]
We propose a simple and practical tool $mathsfFriendlyCore$ that takes a set of points $cal D$ from an unrestricted (pseudo) metric space as input.
When $cal D$ has effective diameter $r$, $mathsfFriendlyCore$ returns a "stable" subset $cal D_Gsubseteq cal D$ that includes all points.
$mathsfFriendlyCore$ can be used to preprocess the input before privately aggregating it, potentially simplifying the aggregation or boosting its accuracy
arXiv Detail & Related papers (2021-10-19T17:43:50Z) - Coreset-based Strategies for Robust Center-type Problems [0.6875312133832077]
We devise coreset-based strategies for the two problems which yield efficient sequential, MapReduce, and Streaming algorithms.
For wide ranges of the parameters $k,zepsilon, D$, we obtain a sequential algorithm with running time linear in $|V|$, and MapReduce/Streaming algorithms with few rounds/passes and substantially sublinear local/working memory.
arXiv Detail & Related papers (2020-02-18T10:04:08Z) - Coresets for the Nearest-Neighbor Rule [78.15296214629433]
Nearest-neighbor condensation deals with finding a subset $R subseteq P$.
This paper introduces the concept of coresets for nearest-neighbor classification.
We propose quadratic-time approximation algorithms with provable upper-bounds on the size of their selected subsets.
arXiv Detail & Related papers (2020-02-16T19:00:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.