Distributed k-Means with Outliers in General Metrics
- URL: http://arxiv.org/abs/2202.08173v2
- Date: Fri, 18 Feb 2022 16:56:37 GMT
- Title: Distributed k-Means with Outliers in General Metrics
- Authors: Enrico Dandolo, Andrea Pietracaprina, Geppino Pucci
- Abstract summary: We present a distributed coreset-based 3-round approximation algorithm for k-means with $z$ outliers for general metric spaces.
An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by the dimension doubling $D$ of the metric space.
- Score: 0.6117371161379208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Center-based clustering is a pivotal primitive for unsupervised learning and
data analysis. A popular variant is undoubtedly the k-means problem, which,
given a set $P$ of points from a metric space and a parameter $k<|P|$, requires
to determine a subset $S$ of $k$ centers minimizing the sum of all squared
distances of points in $P$ from their closest center. A more general
formulation, known as k-means with $z$ outliers, introduced to deal with noisy
datasets, features a further parameter $z$ and allows up to $z$ points of $P$
(outliers) to be disregarded when computing the aforementioned sum. We present
a distributed coreset-based 3-round approximation algorithm for k-means with
$z$ outliers for general metric spaces, using MapReduce as a computational
model. Our distributed algorithm requires sublinear local memory per reducer,
and yields a solution whose approximation ratio is an additive term $O(\gamma)$
away from the one achievable by the best known sequential (possibly bicriteria)
algorithm, where $\gamma$ can be made arbitrarily small. An important feature
of our algorithm is that it obliviously adapts to the intrinsic complexity of
the dataset, captured by the doubling dimension $D$ of the metric space. To the
best of our knowledge, no previous distributed approaches were able to attain
similar quality-performance tradeoffs for general metrics.
Related papers
- Adaptive $k$-nearest neighbor classifier based on the local estimation of the shape operator [49.87315310656657]
We introduce a new adaptive $k$-nearest neighbours ($kK$-NN) algorithm that explores the local curvature at a sample to adaptively defining the neighborhood size.
Results on many real-world datasets indicate that the new $kK$-NN algorithm yields superior balanced accuracy compared to the established $k$-NN method.
arXiv Detail & Related papers (2024-09-08T13:08:45Z) - Private Geometric Median [10.359525525715421]
We study differentially private (DP) algorithms for computing the geometric median (GM) of a dataset.
Our main contribution is a pair of DP algorithms for the task of private GM with an excess error guarantee that scales with the effective diameter of the datapoints.
arXiv Detail & Related papers (2024-06-11T16:13:09Z) - A Unified Framework for Gradient-based Clustering of Distributed Data [51.904327888475606]
We develop a family of distributed clustering algorithms that work over networks of users.
DGC-$mathcalF_rho$ is specialized to popular clustering losses like $K$-means and Huber loss.
We show that consensus fixed points of DGC-$mathcalF_rho$ are equivalent to fixed points of gradient clustering over the full data.
arXiv Detail & Related papers (2024-02-02T10:44:42Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Parameterized Approximation Schemes for Clustering with General Norm
Objectives [0.6956677722498498]
This paper considers the well-studied regime of designing a $(k,epsilon)$-approximation algorithm for a $k$-clustering problem.
Our main contribution is a clean and simple EPAS that settles more than ten clustering problems.
arXiv Detail & Related papers (2023-04-06T15:31:37Z) - Global $k$-means$++$: an effective relaxation of the global $k$-means
clustering algorithm [0.20305676256390928]
The $k$-means algorithm is a prevalent clustering method due to its simplicity, effectiveness, and speed.
We propose the emphglobal $k$-meanstexttt++ clustering algorithm, which is an effective way of acquiring quality clustering solutions.
arXiv Detail & Related papers (2022-11-22T13:42:53Z) - Improved Learning-augmented Algorithms for k-means and k-medians
Clustering [8.04779839951237]
We consider the problem of clustering in the learning-augmented setting, where we are given a data set in $d$-dimensional Euclidean space.
We propose a deterministic $k$-means algorithm that produces centers with improved bound on clustering cost.
Our algorithm works even when the predictions are not very accurate, i.e. our bound holds for $alpha$ up to $1/2$, an improvement over $alpha$ being at most $1/7$ in the previous work.
arXiv Detail & Related papers (2022-10-31T03:00:11Z) - FriendlyCore: Practical Differentially Private Aggregation [67.04951703461657]
We propose a simple and practical tool $mathsfFriendlyCore$ that takes a set of points $cal D$ from an unrestricted (pseudo) metric space as input.
When $cal D$ has effective diameter $r$, $mathsfFriendlyCore$ returns a "stable" subset $cal D_Gsubseteq cal D$ that includes all points.
$mathsfFriendlyCore$ can be used to preprocess the input before privately aggregating it, potentially simplifying the aggregation or boosting its accuracy
arXiv Detail & Related papers (2021-10-19T17:43:50Z) - Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z) - List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions.
Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset.
A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z) - List-Decodable Mean Estimation via Iterative Multi-Filtering [44.805549762166926]
We are given a set $T$ of points in $mathbbRd$ with the promise that an unknown $alpha$-fraction of points in $T$ are drawn from an unknown mean and bounded covariance distribution $D$.
We output a small list of hypothesis vectors such that at least one of them is close to the mean of $D$.
In more detail, our algorithm is sample and computationally efficient, and achieves information-theoretically near-optimal error.
arXiv Detail & Related papers (2020-06-18T17:47:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.